Operations

This section is the production operations entry point.

What you will learn

How to run go-live checks safely.
How to monitor critical production signals.
How to use replay in incident response and verification.

What this section covers

Go-live readiness checks.
Monitoring and SLO guardrails.
Incident response with replay workflow.
Repeatable operational runbooks.
Release-safe rollback and evidence practices.

Task-driven operating paths

Go live this week: start with Go-live Checklist, then run Monitoring and SLO.
Investigate an incident now: start with Incident Response and Replay, then Runbooks.
Build evidence for change approval: use Tutorial: Release Gate with Replay Evidence.

Operating model

flowchart LR
  A["Staging validation"] --> B["Core gate"] --> C["Go-live decision"] --> D["Production traffic"] --> E["Monitoring + replay checks"]

Daily operator checklist

Verify health status and key service metrics.
Run policy sanity checks on a target scope.
Validate one replay chain from recent traffic.

Weekly operator checklist

Run governance and evidence workflows.
Review SLO trend and incident noise.
Confirm rollback and drill readiness.

How to navigate this section

Use the left sidebar to move between checklists, monitoring, incident response, and runbooks.

Recommended reading order