Incident Response and Replay

Use replay as the backbone of incident debugging. The goal is not only to restore service, but to preserve enough evidence to explain what happened and verify the fix.

First 15 minutes

Identify the failing request_id and affected tenant or scope.
Pull run_id, decision_id, and commit_uri from logs if available.
Confirm whether the failure is isolated or release-wide.
Decide whether mitigation or rollback is required before deeper analysis.

Incident workflow

Capture failing request_id from client and server logs.
Resolve run_id, decision_id, and commit_uri chain.
Reconstruct execution path with replay APIs.
Apply mitigation.
Re-run replay to verify fix.

Evidence package

Timeline of key events.
Request/decision payload snapshots.
Replay before/after comparison.
Mitigation and rollback notes.

What good incident evidence looks like

A reproducible failure path tied to concrete IDs
The exact decision or memory object that influenced the bad outcome
A fix verification replay showing the expected behavior
A clear record of whether the issue came from code, policy, or runtime context

Example: replay-based fix validation

Scenario:

A support workflow selected email_sender instead of ticket_router after a rule update.

Validation steps:

Pull original IDs: request_id, run_id, decision_id, commit_uri.
Replay original chain and confirm wrong selection is reproducible.
Apply rule fix.
Replay same workflow and verify selected tool changed to expected target.

Expected comparison:

Check	Before fix	After fix
selected tool	`email_sender`	`ticket_router`
decision_id	old chain	new chain
status	failed expectation	pass

Fast command

bash

curl -sS "$BASE_URL/v1/memory/resolve" \
  -H 'content-type: application/json' \
  -d "{\"tenant_id\":\"default\",\"scope\":\"default\",\"uri\":\"$COMMIT_URI\"}" | jq

Incident exit criteria

The mitigation or rollback is complete.
The replay chain has been reconstructed or explicitly ruled out as unavailable.
The likely root cause is documented with IDs and evidence.
A verification replay or regression check has passed.

Incident Response and Replay ​

First 15 minutes ​

Incident workflow ​

Evidence package ​

What good incident evidence looks like ​

Example: replay-based fix validation ​

Fast command ​

Incident exit criteria ​