docs: ARCH-OBSERVABILITY skeleton + \$dontReport concrete (WS-6)
Initial observability architecture document. Skeleton with §3
(\$dontReport exception list) as the only concrete section. Other
sections are structured placeholders for WS-7 sessie 1 decisions:
- §1 Logging strategy (log levels, criteria)
- §2 Sentry decisions (SDK config, sample rates, breadcrumbs,
release tagging)
- §3 \$dontReport exceptions (concrete) — three classes that are
expected business outcomes, not bugs:
* PublishGuardViolationException (422 publish-time)
* PurposeRequirementsNotMetException (422)
* IdempotencyConflictException (409)
With explicit out-of-scope rationale for the three runtime
pipeline exceptions that DO go to Sentry (PersonProvisioning /
PurposeSubjectResolution / FormBindingApplicator) — engineering
needs cross-org visibility into systemic patterns even when
org admins handle individual failures via the WS-6 admin UI.
- §4 Structured logging conventions (key naming tree)
- §5 Metrics (counters, histograms)
- §6 Alerting rules (thresholds, routing)
- §7 Dashboards (panel layout)
The skeleton ensures WS-7 starts from a clear scope; the concrete
\$dontReport list closes a real Sentry-noise gap immediately
(PublishGuardViolationException etc. should never have hit Sentry).
RFC-WS-6.md §9 cross-references the new doc and adds an
Observability follow-up row.
Refs: WS-6 sessie 3b Task 5, WS-7 (forward)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
121
dev-docs/ARCH-OBSERVABILITY.md
Normal file
121
dev-docs/ARCH-OBSERVABILITY.md
Normal file
@@ -0,0 +1,121 @@
|
|||||||
|
# ARCH-OBSERVABILITY
|
||||||
|
|
||||||
|
> Crewli's observability architecture — logging, monitoring, alerting,
|
||||||
|
> metrics.
|
||||||
|
>
|
||||||
|
> Status: SKELETON. Section §3 (`$dontReport`) is concrete; other
|
||||||
|
> sections are structured placeholders for WS-7 sessie 1 decisions.
|
||||||
|
|
||||||
|
## Document history
|
||||||
|
|
||||||
|
- 2026-04-28 — v0.1 — Initial skeleton (WS-6 sessie 3b). Only §3
|
||||||
|
concrete; remainder placeholdered for WS-7.
|
||||||
|
|
||||||
|
## §1 — Logging strategy
|
||||||
|
|
||||||
|
[WS-7: define log levels with explicit criteria. Example questions to
|
||||||
|
answer in WS-7 sessie 1:
|
||||||
|
|
||||||
|
- When does code use `Log::error` vs `Log::warning` vs `Log::info`?
|
||||||
|
- Are unhandled exceptions automatically `error`?
|
||||||
|
- Is `Log::debug` allowed in production, or stripped in deploy?
|
||||||
|
- How do structured payload conventions tie to log keys (see §4)?
|
||||||
|
]
|
||||||
|
|
||||||
|
## §2 — Sentry decisions
|
||||||
|
|
||||||
|
[WS-7: Sentry SDK install + configuration decisions. Skeleton:
|
||||||
|
|
||||||
|
- Which environments report to Sentry? (dev / staging / production)
|
||||||
|
- Sample rate per environment?
|
||||||
|
- Source map upload to Sentry CI?
|
||||||
|
- User context injection (auth user ID + organisation ID, opt-in
|
||||||
|
redaction for PII)?
|
||||||
|
- Breadcrumbs strategy (which events generate breadcrumbs)?
|
||||||
|
- Release tagging convention (commit SHA? semver? both?)?
|
||||||
|
]
|
||||||
|
|
||||||
|
## §3 — `$dontReport` exceptions (concrete)
|
||||||
|
|
||||||
|
The following exception classes are **expected business outcomes**,
|
||||||
|
not bugs. They are caught and handled in the application; reporting
|
||||||
|
them to Sentry would generate noise that drowns the signal.
|
||||||
|
|
||||||
|
When the Sentry SDK lands (WS-7), add the following classes to
|
||||||
|
Laravel's `app/Exceptions/Handler.php` `$dontReport` array:
|
||||||
|
|
||||||
|
| Class | Reason |
|
||||||
|
|---|---|
|
||||||
|
| `\App\Exceptions\FormBuilder\PublishGuardViolationException` | Publish-time validation: schema fails a guard. Returned as 422 with field-level errors. Not a system bug. |
|
||||||
|
| `\App\Exceptions\FormBuilder\PurposeRequirementsNotMetException` | Schema lacks required bindings for its purpose. Returned as 422. Not a system bug. |
|
||||||
|
| `\App\Exceptions\FormBuilder\IdempotencyConflictException` | Duplicate idempotency key on submission. Returned as 409. Not a system bug. |
|
||||||
|
|
||||||
|
**Out of scope for `$dontReport` (these DO go to Sentry):**
|
||||||
|
|
||||||
|
- `App\Exceptions\FormBuilder\PersonProvisioningException` — runtime
|
||||||
|
failure during the apply pipeline. Caught by
|
||||||
|
`ApplyBindingsOnFormSubmit` and recorded as
|
||||||
|
`FormSubmissionActionFailure`, but the engineering team needs
|
||||||
|
visibility into recurring patterns across orgs.
|
||||||
|
- `App\Exceptions\FormBuilder\PurposeSubjectResolutionException` —
|
||||||
|
runtime resolution failure (no portal token, no auth user, etc.).
|
||||||
|
Same dual-handling rationale: action-failures table for
|
||||||
|
org-admin operational handling; Sentry for engineering visibility.
|
||||||
|
- `App\Exceptions\FormBuilder\FormBindingApplicatorException` —
|
||||||
|
runtime applicator failure (no_transaction, no_schema,
|
||||||
|
unknown_purpose). These should never happen in production; if they
|
||||||
|
do, they're systemic bugs — Sentry is the correct destination.
|
||||||
|
|
||||||
|
The dual recording (Sentry + `form_submission_action_failures` table)
|
||||||
|
is intentional: org admins fix specific failures via the WS-6 admin
|
||||||
|
UI; engineering identifies systemic issues across all orgs via
|
||||||
|
Sentry's aggregation.
|
||||||
|
|
||||||
|
## §4 — Structured logging conventions
|
||||||
|
|
||||||
|
[WS-7: log key naming convention. Skeleton:
|
||||||
|
|
||||||
|
- Hierarchical dot-separated namespace tree
|
||||||
|
- Existing examples to align with:
|
||||||
|
- `form-builder.apply.transaction_rolled_back`
|
||||||
|
- `form-builder.identity-match.no_person_subject_post_apply`
|
||||||
|
- `form-webhook.delivery.exception`
|
||||||
|
|
||||||
|
Define the tree formally so future code discovers the right namespace
|
||||||
|
deterministically.]
|
||||||
|
|
||||||
|
## §5 — Metrics
|
||||||
|
|
||||||
|
[WS-7: which counters / histograms / gauges? Namespace?
|
||||||
|
Statsd / Prometheus / OTel flavour? At minimum, candidate metrics:
|
||||||
|
|
||||||
|
- `form_submissions_total` (counter, tagged by purpose)
|
||||||
|
- `form_submission_apply_status` (counter, tagged by status)
|
||||||
|
- `form_failures_open` (gauge per org)
|
||||||
|
- `retry_attempts_total` (counter, tagged by outcome)
|
||||||
|
- `apply_pipeline_duration_seconds` (histogram)
|
||||||
|
]
|
||||||
|
|
||||||
|
## §6 — Alerting rules
|
||||||
|
|
||||||
|
[WS-7: which thresholds trigger alerts? Where (Slack? PagerDuty?
|
||||||
|
Email?). At minimum, candidate alerts:
|
||||||
|
|
||||||
|
- "Open failures > X for > Y hours"
|
||||||
|
- "Apply pipeline error rate > X% in 1h window"
|
||||||
|
- "no_transaction guard fired" (immediate alert; should never happen
|
||||||
|
in production)
|
||||||
|
- "Webhook dead-letter rate > X%"
|
||||||
|
]
|
||||||
|
|
||||||
|
## §7 — Dashboards
|
||||||
|
|
||||||
|
[WS-7: Grafana / Cloudwatch / similar. Panel layout, widget types,
|
||||||
|
default time ranges. Skeleton later.]
|
||||||
|
|
||||||
|
## Related docs
|
||||||
|
|
||||||
|
- `RFC-WS-6.md` — WS-6 binding pipeline design (the failures observed
|
||||||
|
and recorded by §3's classes originate here)
|
||||||
|
- `ARCH-BINDINGS.md` — apply pipeline architecture
|
||||||
|
- `ARCH-FORM-BUILDER.md` — form-builder runtime including webhooks
|
||||||
@@ -490,6 +490,13 @@ This document. Sessions 2 and 3 reference RFC sections by number rather than re-
|
|||||||
| `LOAD-TEST-FOUNDATION` | Pre-release hardening, separate workstream |
|
| `LOAD-TEST-FOUNDATION` | Pre-release hardening, separate workstream |
|
||||||
| `FORM-BINDING-SNAPSHOT-MULTI` | When patterns require multi-binding-per-field snapshot shape |
|
| `FORM-BINDING-SNAPSHOT-MULTI` | When patterns require multi-binding-per-field snapshot shape |
|
||||||
| Daily failure digest | When notification framework lands |
|
| Daily failure digest | When notification framework lands |
|
||||||
|
| Observability | Sentry SDK, structured logs, metrics, alerts — see `ARCH-OBSERVABILITY.md` skeleton (sessie 3b). WS-7 sessie 1 fills it in. |
|
||||||
|
|
||||||
|
Observability strategy for the WS-6 binding pipeline (Sentry
|
||||||
|
`$dontReport` decisions, log levels, metric names) is documented in
|
||||||
|
`ARCH-OBSERVABILITY.md`. The skeleton landed in WS-6 sessie 3b with
|
||||||
|
§3 (`$dontReport`) concrete; the remaining sections are filled in
|
||||||
|
WS-7 sessie 1.
|
||||||
|
|
||||||
## 10. Document history
|
## 10. Document history
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user