Monitoring and SLO

Use this page to define the minimum signals operators should watch after deployment.

What to monitor first

If you are building dashboards for the first time, start with:

Health and request success rate
Write and recall latency
Policy selection success and decision linkage
Replay success on critical validation workflows

Core metrics

Request throughput and error rate.
Write and recall p95 latency.
Policy decision coverage and feedback coverage.
Replay success ratio for validation runs.

SLO starting point

Use a simple first SLO model:

Availability of the public API surface
Latency for write and recall on the primary tenant or scope
Error budget burn during deployment windows
Replay success on a small set of known-good workflows

Alert recommendations

Sustained 5xx spike above release threshold.
Recall latency regression beyond SLO target.
Missing decision linkage in policy-critical flows.
Replay failure bursts after deployment.

Red flags that deserve immediate investigation

write succeeds but recall_text quality drops sharply.
tools/select responses lose decision_id linkage.
Replay failures spike right after a deployment or rule change.
One tenant or scope degrades while global health remains green.

Practical review cadence

Daily: health + latency + error trends.
Weekly: SLO trend + replay quality + governance drift.