docs: glitchtip runbook + setup + RFC §3.1 dev amendment
Operational docs for the GlitchTip stack landed in the previous two commits. - dev-docs/GLITCHTIP.md: new runbook covering local dev, project provisioning + DSN-to-vault flow, production deploy on monitoring.hausdesign.nl (DNS, DirectAdmin Let's Encrypt, Apache reverse proxy with WS upgrade), backup install + restore drill, smoke tests, troubleshooting. - dev-docs/SETUP.md: services table now includes GlitchTip; new docker/glitchtip/.env subsection points at the runbook. - dev-docs/RFC-WS-7-OBSERVABILITY.md §3.1: amended to record that the same compose file drives local dev (Mailpit at bm_mailpit:1025), so prod and dev cannot drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
283
dev-docs/GLITCHTIP.md
Normal file
283
dev-docs/GLITCHTIP.md
Normal file
@@ -0,0 +1,283 @@
|
||||
# GlitchTip — operations runbook
|
||||
|
||||
Self-hosted error tracking for Crewli. GlitchTip implements the Sentry
|
||||
event protocol; the official Sentry SDKs (`sentry-laravel`, `@sentry/vue`,
|
||||
`@sentry/cli`) work against it without modification.
|
||||
|
||||
Reference: [`RFC-WS-7-OBSERVABILITY.md`](./RFC-WS-7-OBSERVABILITY.md).
|
||||
|
||||
This file documents how to run the stack — locally and on the production
|
||||
monitoring host. PR-2 (backend SDK) and PR-3 (frontend SDK) consume DSNs
|
||||
provisioned via the steps below.
|
||||
|
||||
---
|
||||
|
||||
## 1. Overview
|
||||
|
||||
| Service | Image | Role |
|
||||
|---------|-------|------|
|
||||
| `glitchtip-web` | `glitchtip/glitchtip:6.1.6` | Django web UI + ingest API |
|
||||
| `glitchtip-worker` | `glitchtip/glitchtip:6.1.6` | Celery worker + beat (event processing, alerts, partition maintenance) |
|
||||
| `glitchtip-postgres` | `postgres:16-alpine` | Primary datastore |
|
||||
| `glitchtip-redis` | `valkey/valkey:7-alpine` | Celery broker + cache |
|
||||
|
||||
The same `docker-compose.glitchtip.yml` runs both locally (merged with
|
||||
`docker-compose.yml`) and on the production host (standalone). Container
|
||||
names are identical in both environments to avoid configuration drift.
|
||||
|
||||
---
|
||||
|
||||
## 2. Local development
|
||||
|
||||
```bash
|
||||
# Once
|
||||
cp docker/glitchtip/.env.example docker/glitchtip/.env
|
||||
|
||||
# Boot the full stack (MySQL, Redis, Mailpit, GlitchTip)
|
||||
make services
|
||||
|
||||
# First boot takes ~60s while migrations run. Tail progress:
|
||||
make services-glitchtip-status
|
||||
```
|
||||
|
||||
Web UI: <http://localhost:8200>. Outbound mail goes to Mailpit
|
||||
(`http://localhost:8025`).
|
||||
|
||||
Create the first admin user:
|
||||
|
||||
```bash
|
||||
docker exec -it glitchtip-web ./manage.py createsuperuser
|
||||
```
|
||||
|
||||
Stop the stack with `make services-stop`. Volumes (`glitchtip_postgres_data`,
|
||||
`glitchtip_redis_data`, `glitchtip_uploads`) survive a stop. Wipe with
|
||||
`docker compose -f docker-compose.yml -f docker-compose.glitchtip.yml down -v`
|
||||
— **never on production**.
|
||||
|
||||
---
|
||||
|
||||
## 3. Project provisioning
|
||||
|
||||
Once the web UI is reachable and the superuser exists:
|
||||
|
||||
1. Sign in at `/`.
|
||||
2. Create an Organization called **Crewli**.
|
||||
3. Create two projects:
|
||||
- **`crewli-api`** — platform: Python / Django, alert rules: default.
|
||||
- **`crewli-app`** — platform: JavaScript / Vue, alert rules: default.
|
||||
4. For each project, copy the auto-generated DSN from
|
||||
*Settings → Client Keys (DSN)*.
|
||||
5. Store both DSNs in 1Password under `Crewli / GlitchTip / DSNs`:
|
||||
- `SENTRY_DSN_BACKEND` ← `crewli-api` DSN
|
||||
- `SENTRY_DSN_FRONTEND` ← `crewli-app` DSN
|
||||
|
||||
PR-2 wires `SENTRY_DSN_BACKEND` into `api/.env.example`; PR-3 wires
|
||||
`SENTRY_DSN_FRONTEND` into `apps/app/.env.example`. Empty DSN = SDK no-op
|
||||
(verified for both `sentry-laravel` and `@sentry/vue`), so dev environments
|
||||
without a DSN are silent.
|
||||
|
||||
---
|
||||
|
||||
## 4. Production deployment
|
||||
|
||||
GlitchTip runs on a separate host (`monitoring.hausdesign.nl`) and is **not**
|
||||
deployed via the Crewli `deploy.sh` pipeline.
|
||||
|
||||
### 4.1 Prerequisites
|
||||
|
||||
- Docker + Docker Compose v2 on the monitoring host.
|
||||
- DirectAdmin with the Let's Encrypt module enabled.
|
||||
- DNS A-record `monitoring.hausdesign.nl` pointing at the host IP.
|
||||
|
||||
### 4.2 Place the stack
|
||||
|
||||
```bash
|
||||
sudo install -d -o crewli -g crewli /opt/glitchtip
|
||||
sudo install -d -o crewli -g crewli /opt/glitchtip/docker/glitchtip
|
||||
|
||||
# Copy compose file + env example to the host (e.g. via scp or git checkout).
|
||||
# /opt/glitchtip/docker-compose.glitchtip.yml
|
||||
# /opt/glitchtip/docker/glitchtip/.env.example
|
||||
```
|
||||
|
||||
### 4.3 Configure `.env`
|
||||
|
||||
```bash
|
||||
cd /opt/glitchtip
|
||||
cp docker/glitchtip/.env.example docker/glitchtip/.env
|
||||
chmod 0600 docker/glitchtip/.env
|
||||
```
|
||||
|
||||
Fill in the production values (header of `.env.example` lists the
|
||||
checklist):
|
||||
|
||||
```env
|
||||
SECRET_KEY=<python -c "import secrets; print(secrets.token_urlsafe(50))">
|
||||
DATABASE_URL=postgres://postgres:<STRONG>@glitchtip-postgres:5432/glitchtip
|
||||
POSTGRES_PASSWORD=<STRONG> # MUST match the password in DATABASE_URL
|
||||
GLITCHTIP_DOMAIN=https://monitoring.hausdesign.nl
|
||||
DEFAULT_FROM_EMAIL=glitchtip@hausdesign.nl
|
||||
EMAIL_URL=smtp+tls://USER:PASSWORD@HOST:PORT
|
||||
```
|
||||
|
||||
Source the `<STRONG>` password from the 1Password vault.
|
||||
|
||||
### 4.4 DNS + TLS
|
||||
|
||||
1. Create the A-record for `monitoring.hausdesign.nl` in DNS.
|
||||
2. In DirectAdmin: add the subdomain, then enable Let's Encrypt
|
||||
(Domain Setup → SSL Certificates → "Free & automatic certificate from
|
||||
Let's Encrypt"). Wait for the cert to issue.
|
||||
|
||||
### 4.5 Apache reverse proxy
|
||||
|
||||
DirectAdmin generates the vhost. Add a custom config (DirectAdmin →
|
||||
Custom HTTPD Configurations) for the `monitoring.hausdesign.nl` HTTPS
|
||||
vhost:
|
||||
|
||||
```apache
|
||||
ProxyPreserveHost On
|
||||
ProxyRequests Off
|
||||
ProxyPass / http://127.0.0.1:8200/
|
||||
ProxyPassReverse / http://127.0.0.1:8200/
|
||||
|
||||
# WebSocket upgrade — GlitchTip uses WS for live event streaming.
|
||||
RewriteEngine On
|
||||
RewriteCond %{HTTP:Upgrade} websocket [NC]
|
||||
RewriteCond %{HTTP:Connection} upgrade [NC]
|
||||
RewriteRule ^/?(.*) "ws://127.0.0.1:8200/$1" [P,L]
|
||||
```
|
||||
|
||||
Reload Apache.
|
||||
|
||||
### 4.6 First boot
|
||||
|
||||
```bash
|
||||
cd /opt/glitchtip
|
||||
docker compose -f docker-compose.glitchtip.yml up -d
|
||||
|
||||
# Wait for healthchecks (~60s).
|
||||
docker compose -f docker-compose.glitchtip.yml ps
|
||||
|
||||
# Create the admin user.
|
||||
docker exec -it glitchtip-web ./manage.py createsuperuser
|
||||
```
|
||||
|
||||
Open <https://monitoring.hausdesign.nl>, sign in, and **enable 2FA** on
|
||||
the account immediately (acceptance criterion 1). Profile → Security →
|
||||
Two-Factor Authentication.
|
||||
|
||||
Then provision the two projects (§3) and capture DSNs into 1Password.
|
||||
|
||||
---
|
||||
|
||||
## 5. Backup & restore
|
||||
|
||||
### 5.1 Daily backup
|
||||
|
||||
`scripts/glitchtip-backup.sh` runs `pg_dump --format=custom`, streams it
|
||||
through gzip, writes to `./backups/glitchtip/glitchtip-<ts>.dump.gz` with
|
||||
`0600` permissions, and prunes dumps older than 30 days.
|
||||
|
||||
Install the cron entry on the production host:
|
||||
|
||||
```cron
|
||||
# /etc/cron.d/glitchtip-backup
|
||||
0 3 * * * crewli /opt/crewli/scripts/glitchtip-backup.sh >> /var/log/glitchtip-backup.log 2>&1
|
||||
```
|
||||
|
||||
(Replace `/opt/crewli` with wherever the Crewli repo checkout lives on
|
||||
the monitoring host. The script is portable — only the `docker exec`
|
||||
target container needs to exist.)
|
||||
|
||||
The script exits non-zero on dump failure so cron's `MAILTO` catches
|
||||
silent regressions.
|
||||
|
||||
### 5.2 Restore drill
|
||||
|
||||
```bash
|
||||
# Pick the dump to restore from.
|
||||
DUMP=./backups/glitchtip/glitchtip-20260506-030000.dump.gz
|
||||
|
||||
# Stream the restore into the postgres container.
|
||||
gunzip < "$DUMP" \
|
||||
| docker exec -i glitchtip-postgres pg_restore \
|
||||
-U postgres -d glitchtip --clean --if-exists
|
||||
```
|
||||
|
||||
`--clean --if-exists` drops existing objects before recreating them, so
|
||||
the database ends up exactly as it was at dump time. Run after a
|
||||
`docker compose stop glitchtip-web glitchtip-worker` to avoid concurrent
|
||||
writes during the restore.
|
||||
|
||||
Bert should drill the restore at least once after the production stack
|
||||
is live (acceptance criterion 11).
|
||||
|
||||
---
|
||||
|
||||
## 6. Monitoring the monitor
|
||||
|
||||
Quick smoke tests:
|
||||
|
||||
```bash
|
||||
# API responds with JSON (not 502).
|
||||
curl -sS http://localhost:8200/api/0/
|
||||
|
||||
# Worker reporting in (look for "celery@... ready").
|
||||
docker compose -f docker-compose.yml -f docker-compose.glitchtip.yml \
|
||||
logs --tail=50 glitchtip-worker
|
||||
|
||||
# All services healthy.
|
||||
docker compose -f docker-compose.yml -f docker-compose.glitchtip.yml ps
|
||||
```
|
||||
|
||||
In production, replace `localhost:8200` with `https://monitoring.hausdesign.nl`.
|
||||
Email-alerting is configured in PR-4; until then alerts surface only in
|
||||
the GlitchTip web UI (Issues view).
|
||||
|
||||
---
|
||||
|
||||
## 7. Troubleshooting
|
||||
|
||||
### Web container unhealthy on first boot
|
||||
|
||||
Migrations take ~60s on a fresh volume. The healthcheck `start_period`
|
||||
is set accordingly. If the container is still unhealthy after two
|
||||
minutes, tail logs:
|
||||
|
||||
```bash
|
||||
docker logs glitchtip-web
|
||||
```
|
||||
|
||||
Most common cause: `DATABASE_URL` password ≠ `POSTGRES_PASSWORD`. The
|
||||
postgres container creates the user with the password it sees, GlitchTip
|
||||
authenticates with the password embedded in the URL — they MUST match.
|
||||
|
||||
### Worker idle / events stuck in queue
|
||||
|
||||
Check that `REDIS_URL` resolves and the worker is connected:
|
||||
|
||||
```bash
|
||||
docker logs glitchtip-worker | grep -E "ready|connected|error"
|
||||
```
|
||||
|
||||
### Volume permission errors on Linux hosts
|
||||
|
||||
`postgres:16-alpine` runs as UID 70 internally. If `/var/lib/postgresql/data`
|
||||
is bind-mounted from the host with mismatched ownership, postgres refuses
|
||||
to start. The default named volume avoids this — only relevant if you
|
||||
later switch to a host bind-mount.
|
||||
|
||||
### Right-to-erasure (Art. 17)
|
||||
|
||||
Currently manual. Locate events for a user ULID via the web UI search,
|
||||
delete via the UI or directly on the postgres container. An automated
|
||||
erasure script is on the BACKLOG (per RFC §4).
|
||||
|
||||
---
|
||||
|
||||
## 8. References
|
||||
|
||||
- RFC: [`RFC-WS-7-OBSERVABILITY.md`](./RFC-WS-7-OBSERVABILITY.md)
|
||||
- GlitchTip docs: <https://glitchtip.com/documentation>
|
||||
- GlitchTip self-hosting: <https://glitchtip.com/documentation/install>
|
||||
@@ -30,6 +30,8 @@ Twee afwijkingen van charter §3 besluit 8, beide bewust:
|
||||
|
||||
Self-hosted GlitchTip op productie VPS via Docker Compose (`glitchtip-web`, `glitchtip-worker`, `glitchtip-postgres`, `glitchtip-redis`). Reverse proxy via DirectAdmin Apache; SSL via DirectAdmin Let's Encrypt op `monitoring.hausdesign.nl` (consistent met bestaande subdomain-pattern).
|
||||
|
||||
**Lokale ontwikkeling:** dezelfde `docker-compose.glitchtip.yml` draait lokaal als `make services` (gecombineerd met de bestaande `docker-compose.yml` via `-f`). Web-UI op `http://localhost:8200`, e-mail naar Mailpit op `bm_mailpit:1025`. Dev-stack en prod-stack delen één compose-file zodat configuratie-drift uitgesloten is.
|
||||
|
||||
### 3.2 Twee projecten / DSNs
|
||||
|
||||
- `crewli-api` — Laravel
|
||||
|
||||
@@ -70,11 +70,18 @@ Three terminal tabs, plus an optional fourth for the queue worker:
|
||||
|
||||
| Terminal | Command | Where it runs | Port |
|
||||
|----------|---------|---------------|------|
|
||||
| 1. Services | `make services` (from repo root) | Docker | 3306 (MySQL), 6379 (Redis), 8025 (Mailpit) |
|
||||
| 1. Services | `make services` (from repo root) | Docker | 3306 (MySQL), 6379 (Redis), 8025 (Mailpit), 8200 (GlitchTip) |
|
||||
| 2. API | `make api` (from repo root) | Laravel dev server | 8000 |
|
||||
| 3. SPA | `make app` (from repo root) | Vite dev server | 5174 |
|
||||
| 4. Queue worker (optional) | `cd api && php artisan queue:listen redis --queue=emails` | Local PHP | n/a |
|
||||
|
||||
Web UIs available once `make services` is up:
|
||||
|
||||
| Service | URL |
|
||||
|---------|-----|
|
||||
| Mailpit | <http://localhost:8025> |
|
||||
| GlitchTip | <http://localhost:8200> (admin UI; first boot ~60s while migrations run) |
|
||||
|
||||
The queue worker is only needed when you're triggering email flows (registration, password reset, email change, invitations). Routine UI work doesn't require it.
|
||||
|
||||
Stop services when done: `make services-stop`.
|
||||
@@ -116,6 +123,13 @@ VITE_APP_NAME="Crewli"
|
||||
|
||||
For production: `VITE_API_URL=https://api.crewli.app`.
|
||||
|
||||
### `docker/glitchtip/.env`
|
||||
|
||||
Generated by copying `docker/glitchtip/.env.example`. Dev defaults are
|
||||
functional out of the box — no edits needed for `make services`. See
|
||||
[`GLITCHTIP.md`](./GLITCHTIP.md) for first-boot steps (creating the
|
||||
superuser, creating the two projects, copying DSNs to 1Password).
|
||||
|
||||
## Common tasks
|
||||
|
||||
### Run tests
|
||||
|
||||
Reference in New Issue
Block a user