Files
crewli/dev-docs/ARCH-OBSERVABILITY.md
bert.hausmans ddacf9363e docs: ARCH-OBSERVABILITY skeleton + \$dontReport concrete (WS-6)
Initial observability architecture document. Skeleton with §3
(\$dontReport exception list) as the only concrete section. Other
sections are structured placeholders for WS-7 sessie 1 decisions:

  - §1 Logging strategy (log levels, criteria)
  - §2 Sentry decisions (SDK config, sample rates, breadcrumbs,
    release tagging)
  - §3 \$dontReport exceptions (concrete) — three classes that are
    expected business outcomes, not bugs:
      * PublishGuardViolationException (422 publish-time)
      * PurposeRequirementsNotMetException (422)
      * IdempotencyConflictException (409)
    With explicit out-of-scope rationale for the three runtime
    pipeline exceptions that DO go to Sentry (PersonProvisioning /
    PurposeSubjectResolution / FormBindingApplicator) — engineering
    needs cross-org visibility into systemic patterns even when
    org admins handle individual failures via the WS-6 admin UI.
  - §4 Structured logging conventions (key naming tree)
  - §5 Metrics (counters, histograms)
  - §6 Alerting rules (thresholds, routing)
  - §7 Dashboards (panel layout)

The skeleton ensures WS-7 starts from a clear scope; the concrete
\$dontReport list closes a real Sentry-noise gap immediately
(PublishGuardViolationException etc. should never have hit Sentry).

RFC-WS-6.md §9 cross-references the new doc and adds an
Observability follow-up row.

Refs: WS-6 sessie 3b Task 5, WS-7 (forward)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:14:19 +02:00

4.6 KiB

ARCH-OBSERVABILITY

Crewli's observability architecture — logging, monitoring, alerting, metrics.

Status: SKELETON. Section §3 ($dontReport) is concrete; other sections are structured placeholders for WS-7 sessie 1 decisions.

Document history

  • 2026-04-28 — v0.1 — Initial skeleton (WS-6 sessie 3b). Only §3 concrete; remainder placeholdered for WS-7.

§1 — Logging strategy

[WS-7: define log levels with explicit criteria. Example questions to answer in WS-7 sessie 1:

  • When does code use Log::error vs Log::warning vs Log::info?
  • Are unhandled exceptions automatically error?
  • Is Log::debug allowed in production, or stripped in deploy?
  • How do structured payload conventions tie to log keys (see §4)? ]

§2 — Sentry decisions

[WS-7: Sentry SDK install + configuration decisions. Skeleton:

  • Which environments report to Sentry? (dev / staging / production)
  • Sample rate per environment?
  • Source map upload to Sentry CI?
  • User context injection (auth user ID + organisation ID, opt-in redaction for PII)?
  • Breadcrumbs strategy (which events generate breadcrumbs)?
  • Release tagging convention (commit SHA? semver? both?)? ]

§3 — $dontReport exceptions (concrete)

The following exception classes are expected business outcomes, not bugs. They are caught and handled in the application; reporting them to Sentry would generate noise that drowns the signal.

When the Sentry SDK lands (WS-7), add the following classes to Laravel's app/Exceptions/Handler.php $dontReport array:

Class Reason
\App\Exceptions\FormBuilder\PublishGuardViolationException Publish-time validation: schema fails a guard. Returned as 422 with field-level errors. Not a system bug.
\App\Exceptions\FormBuilder\PurposeRequirementsNotMetException Schema lacks required bindings for its purpose. Returned as 422. Not a system bug.
\App\Exceptions\FormBuilder\IdempotencyConflictException Duplicate idempotency key on submission. Returned as 409. Not a system bug.

Out of scope for $dontReport (these DO go to Sentry):

  • App\Exceptions\FormBuilder\PersonProvisioningException — runtime failure during the apply pipeline. Caught by ApplyBindingsOnFormSubmit and recorded as FormSubmissionActionFailure, but the engineering team needs visibility into recurring patterns across orgs.
  • App\Exceptions\FormBuilder\PurposeSubjectResolutionException — runtime resolution failure (no portal token, no auth user, etc.). Same dual-handling rationale: action-failures table for org-admin operational handling; Sentry for engineering visibility.
  • App\Exceptions\FormBuilder\FormBindingApplicatorException — runtime applicator failure (no_transaction, no_schema, unknown_purpose). These should never happen in production; if they do, they're systemic bugs — Sentry is the correct destination.

The dual recording (Sentry + form_submission_action_failures table) is intentional: org admins fix specific failures via the WS-6 admin UI; engineering identifies systemic issues across all orgs via Sentry's aggregation.

§4 — Structured logging conventions

[WS-7: log key naming convention. Skeleton:

  • Hierarchical dot-separated namespace tree
  • Existing examples to align with:
    • form-builder.apply.transaction_rolled_back
    • form-builder.identity-match.no_person_subject_post_apply
    • form-webhook.delivery.exception

Define the tree formally so future code discovers the right namespace deterministically.]

§5 — Metrics

[WS-7: which counters / histograms / gauges? Namespace? Statsd / Prometheus / OTel flavour? At minimum, candidate metrics:

  • form_submissions_total (counter, tagged by purpose)
  • form_submission_apply_status (counter, tagged by status)
  • form_failures_open (gauge per org)
  • retry_attempts_total (counter, tagged by outcome)
  • apply_pipeline_duration_seconds (histogram) ]

§6 — Alerting rules

[WS-7: which thresholds trigger alerts? Where (Slack? PagerDuty? Email?). At minimum, candidate alerts:

  • "Open failures > X for > Y hours"
  • "Apply pipeline error rate > X% in 1h window"
  • "no_transaction guard fired" (immediate alert; should never happen in production)
  • "Webhook dead-letter rate > X%" ]

§7 — Dashboards

[WS-7: Grafana / Cloudwatch / similar. Panel layout, widget types, default time ranges. Skeleton later.]

  • RFC-WS-6.md — WS-6 binding pipeline design (the failures observed and recorded by §3's classes originate here)
  • ARCH-BINDINGS.md — apply pipeline architecture
  • ARCH-FORM-BUILDER.md — form-builder runtime including webhooks