From ddacf9363e363cf9988addcd1b533c7e1dcf3d01 Mon Sep 17 00:00:00 2001 From: "bert.hausmans" Date: Tue, 28 Apr 2026 21:52:08 +0200 Subject: [PATCH] docs: ARCH-OBSERVABILITY skeleton + \$dontReport concrete (WS-6) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Initial observability architecture document. Skeleton with §3 (\$dontReport exception list) as the only concrete section. Other sections are structured placeholders for WS-7 sessie 1 decisions: - §1 Logging strategy (log levels, criteria) - §2 Sentry decisions (SDK config, sample rates, breadcrumbs, release tagging) - §3 \$dontReport exceptions (concrete) — three classes that are expected business outcomes, not bugs: * PublishGuardViolationException (422 publish-time) * PurposeRequirementsNotMetException (422) * IdempotencyConflictException (409) With explicit out-of-scope rationale for the three runtime pipeline exceptions that DO go to Sentry (PersonProvisioning / PurposeSubjectResolution / FormBindingApplicator) — engineering needs cross-org visibility into systemic patterns even when org admins handle individual failures via the WS-6 admin UI. - §4 Structured logging conventions (key naming tree) - §5 Metrics (counters, histograms) - §6 Alerting rules (thresholds, routing) - §7 Dashboards (panel layout) The skeleton ensures WS-7 starts from a clear scope; the concrete \$dontReport list closes a real Sentry-noise gap immediately (PublishGuardViolationException etc. should never have hit Sentry). RFC-WS-6.md §9 cross-references the new doc and adds an Observability follow-up row. Refs: WS-6 sessie 3b Task 5, WS-7 (forward) Co-Authored-By: Claude Opus 4.7 (1M context) --- dev-docs/ARCH-OBSERVABILITY.md | 121 +++++++++++++++++++++++++++++++++ dev-docs/RFC-WS-6.md | 7 ++ 2 files changed, 128 insertions(+) create mode 100644 dev-docs/ARCH-OBSERVABILITY.md diff --git a/dev-docs/ARCH-OBSERVABILITY.md b/dev-docs/ARCH-OBSERVABILITY.md new file mode 100644 index 00000000..24b92ac7 --- /dev/null +++ b/dev-docs/ARCH-OBSERVABILITY.md @@ -0,0 +1,121 @@ +# ARCH-OBSERVABILITY + +> Crewli's observability architecture — logging, monitoring, alerting, +> metrics. +> +> Status: SKELETON. Section §3 (`$dontReport`) is concrete; other +> sections are structured placeholders for WS-7 sessie 1 decisions. + +## Document history + +- 2026-04-28 — v0.1 — Initial skeleton (WS-6 sessie 3b). Only §3 + concrete; remainder placeholdered for WS-7. + +## §1 — Logging strategy + +[WS-7: define log levels with explicit criteria. Example questions to +answer in WS-7 sessie 1: + +- When does code use `Log::error` vs `Log::warning` vs `Log::info`? +- Are unhandled exceptions automatically `error`? +- Is `Log::debug` allowed in production, or stripped in deploy? +- How do structured payload conventions tie to log keys (see §4)? +] + +## §2 — Sentry decisions + +[WS-7: Sentry SDK install + configuration decisions. Skeleton: + +- Which environments report to Sentry? (dev / staging / production) +- Sample rate per environment? +- Source map upload to Sentry CI? +- User context injection (auth user ID + organisation ID, opt-in + redaction for PII)? +- Breadcrumbs strategy (which events generate breadcrumbs)? +- Release tagging convention (commit SHA? semver? both?)? +] + +## §3 — `$dontReport` exceptions (concrete) + +The following exception classes are **expected business outcomes**, +not bugs. They are caught and handled in the application; reporting +them to Sentry would generate noise that drowns the signal. + +When the Sentry SDK lands (WS-7), add the following classes to +Laravel's `app/Exceptions/Handler.php` `$dontReport` array: + +| Class | Reason | +|---|---| +| `\App\Exceptions\FormBuilder\PublishGuardViolationException` | Publish-time validation: schema fails a guard. Returned as 422 with field-level errors. Not a system bug. | +| `\App\Exceptions\FormBuilder\PurposeRequirementsNotMetException` | Schema lacks required bindings for its purpose. Returned as 422. Not a system bug. | +| `\App\Exceptions\FormBuilder\IdempotencyConflictException` | Duplicate idempotency key on submission. Returned as 409. Not a system bug. | + +**Out of scope for `$dontReport` (these DO go to Sentry):** + +- `App\Exceptions\FormBuilder\PersonProvisioningException` — runtime + failure during the apply pipeline. Caught by + `ApplyBindingsOnFormSubmit` and recorded as + `FormSubmissionActionFailure`, but the engineering team needs + visibility into recurring patterns across orgs. +- `App\Exceptions\FormBuilder\PurposeSubjectResolutionException` — + runtime resolution failure (no portal token, no auth user, etc.). + Same dual-handling rationale: action-failures table for + org-admin operational handling; Sentry for engineering visibility. +- `App\Exceptions\FormBuilder\FormBindingApplicatorException` — + runtime applicator failure (no_transaction, no_schema, + unknown_purpose). These should never happen in production; if they + do, they're systemic bugs — Sentry is the correct destination. + +The dual recording (Sentry + `form_submission_action_failures` table) +is intentional: org admins fix specific failures via the WS-6 admin +UI; engineering identifies systemic issues across all orgs via +Sentry's aggregation. + +## §4 — Structured logging conventions + +[WS-7: log key naming convention. Skeleton: + +- Hierarchical dot-separated namespace tree +- Existing examples to align with: + - `form-builder.apply.transaction_rolled_back` + - `form-builder.identity-match.no_person_subject_post_apply` + - `form-webhook.delivery.exception` + +Define the tree formally so future code discovers the right namespace +deterministically.] + +## §5 — Metrics + +[WS-7: which counters / histograms / gauges? Namespace? +Statsd / Prometheus / OTel flavour? At minimum, candidate metrics: + +- `form_submissions_total` (counter, tagged by purpose) +- `form_submission_apply_status` (counter, tagged by status) +- `form_failures_open` (gauge per org) +- `retry_attempts_total` (counter, tagged by outcome) +- `apply_pipeline_duration_seconds` (histogram) +] + +## §6 — Alerting rules + +[WS-7: which thresholds trigger alerts? Where (Slack? PagerDuty? +Email?). At minimum, candidate alerts: + +- "Open failures > X for > Y hours" +- "Apply pipeline error rate > X% in 1h window" +- "no_transaction guard fired" (immediate alert; should never happen + in production) +- "Webhook dead-letter rate > X%" +] + +## §7 — Dashboards + +[WS-7: Grafana / Cloudwatch / similar. Panel layout, widget types, +default time ranges. Skeleton later.] + +## Related docs + +- `RFC-WS-6.md` — WS-6 binding pipeline design (the failures observed + and recorded by §3's classes originate here) +- `ARCH-BINDINGS.md` — apply pipeline architecture +- `ARCH-FORM-BUILDER.md` — form-builder runtime including webhooks diff --git a/dev-docs/RFC-WS-6.md b/dev-docs/RFC-WS-6.md index f1239784..a25f97c1 100644 --- a/dev-docs/RFC-WS-6.md +++ b/dev-docs/RFC-WS-6.md @@ -490,6 +490,13 @@ This document. Sessions 2 and 3 reference RFC sections by number rather than re- | `LOAD-TEST-FOUNDATION` | Pre-release hardening, separate workstream | | `FORM-BINDING-SNAPSHOT-MULTI` | When patterns require multi-binding-per-field snapshot shape | | Daily failure digest | When notification framework lands | +| Observability | Sentry SDK, structured logs, metrics, alerts — see `ARCH-OBSERVABILITY.md` skeleton (sessie 3b). WS-7 sessie 1 fills it in. | + +Observability strategy for the WS-6 binding pipeline (Sentry +`$dontReport` decisions, log levels, metric names) is documented in +`ARCH-OBSERVABILITY.md`. The skeleton landed in WS-6 sessie 3b with +§3 (`$dontReport`) concrete; the remaining sections are filled in +WS-7 sessie 1. ## 10. Document history