Appearance
Monitoring and SLO
Use this page to define the minimum signals operators should watch after deployment.
What to monitor first
If you are building dashboards for the first time, start with:
- Health and request success rate
- Write and recall latency
- Policy selection success and decision linkage
- Replay success on critical validation workflows
Core metrics
- Request throughput and error rate.
- Write and recall p95 latency.
- Policy decision coverage and feedback coverage.
- Replay success ratio for validation runs.
SLO starting point
Use a simple first SLO model:
- Availability of the public API surface
- Latency for write and recall on the primary tenant or scope
- Error budget burn during deployment windows
- Replay success on a small set of known-good workflows
Alert recommendations
- Sustained 5xx spike above release threshold.
- Recall latency regression beyond SLO target.
- Missing decision linkage in policy-critical flows.
- Replay failure bursts after deployment.
Red flags that deserve immediate investigation
writesucceeds butrecall_textquality drops sharply.tools/selectresponses losedecision_idlinkage.- Replay failures spike right after a deployment or rule change.
- One tenant or scope degrades while global health remains green.
Practical review cadence
- Daily: health + latency + error trends.
- Weekly: SLO trend + replay quality + governance drift.