Appearance
Incident Response and Replay
Use replay as the backbone of incident debugging. The goal is not only to restore service, but to preserve enough evidence to explain what happened and verify the fix.
First 15 minutes
- Identify the failing
request_idand affected tenant or scope. - Pull
run_id,decision_id, andcommit_urifrom logs if available. - Confirm whether the failure is isolated or release-wide.
- Decide whether mitigation or rollback is required before deeper analysis.
Incident workflow
- Capture failing
request_idfrom client and server logs. - Resolve
run_id,decision_id, andcommit_urichain. - Reconstruct execution path with replay APIs.
- Apply mitigation.
- Re-run replay to verify fix.
Evidence package
- Timeline of key events.
- Request/decision payload snapshots.
- Replay before/after comparison.
- Mitigation and rollback notes.
What good incident evidence looks like
- A reproducible failure path tied to concrete IDs
- The exact decision or memory object that influenced the bad outcome
- A fix verification replay showing the expected behavior
- A clear record of whether the issue came from code, policy, or runtime context
Example: replay-based fix validation
Scenario:
- A support workflow selected
email_senderinstead ofticket_routerafter a rule update.
Validation steps:
- Pull original IDs:
request_id,run_id,decision_id,commit_uri. - Replay original chain and confirm wrong selection is reproducible.
- Apply rule fix.
- Replay same workflow and verify selected tool changed to expected target.
Expected comparison:
| Check | Before fix | After fix |
|---|---|---|
| selected tool | email_sender | ticket_router |
| decision_id | old chain | new chain |
| status | failed expectation | pass |
Fast command
bash
curl -sS "$BASE_URL/v1/memory/resolve" \
-H 'content-type: application/json' \
-d "{\"tenant_id\":\"default\",\"scope\":\"default\",\"uri\":\"$COMMIT_URI\"}" | jqIncident exit criteria
- The mitigation or rollback is complete.
- The replay chain has been reconstructed or explicitly ruled out as unavailable.
- The likely root cause is documented with IDs and evidence.
- A verification replay or regression check has passed.