Appearance
Tutorial: Incident Replay for a Production Failure
Use Aionis replay primitives to reconstruct and verify a failed workflow.
Before you start
- You have observed a failed or suspicious run in logs.
- You can access
run_idordecision_id. - You know the
tenant_idandscopefor the incident.
What you will finish with
A replay evidence chain you can attach to incident review and use to verify the fix.
Tip - Copy and run Use the copy button on each code block. For consistent environment variables, run One-click Environment Template.
Input
Input fields
| Field | Required | Used in steps | Example |
|---|---|---|---|
BASE_URL | Yes | 2, 3 | http://localhost:3001 |
AIONIS_API_KEY | Yes | 2, 3 | aionis_live_xxx |
tenant_id | Yes | 2, 3 | default |
scope | Yes | 2, 3 | support |
run_id | Yes | 2 | 3d1868e2-e6d3-4f69-952e-61f53ef2ef30 |
decision_id | Conditional | 3 | 8fe92f61-... |
commit_id | Conditional | 3 | commit_xxx |
Output fields to persist
| Field | Source step | Why keep it |
|---|---|---|
request_id | 2, 3 | Incident timeline correlation |
run_id | 2 | Replay run anchor |
timeline[] | 2 | Step-by-step incident evidence |
steps[] | 2 | Failure boundary and postconditions |
artifacts[] | 2 | Forensics and audit attachment |
resolved payload | 3 | Decision/commit root-cause verification |
Steps
Step 1: Collect incident identifiers
From app and gateway logs, collect:
request_idrun_iddecision_idcommit_uritenant_idscope
Step 2: Pull replay timeline
TypeScript
ts
const replayRes = await fetch(`${process.env.BASE_URL}/v1/memory/replay/runs/get`, {
method: 'POST',
headers: {
'content-type': 'application/json',
'x-api-key': process.env.AIONIS_API_KEY!
},
body: JSON.stringify({
tenant_id: 'default',
scope: 'support',
run_id: '<run_uuid>',
include_steps: true,
include_artifacts: true
})
})
const replay = await replayRes.json()
console.log(replay)Python
python
import os
import requests
replay = requests.post(
f"{os.environ['BASE_URL']}/v1/memory/replay/runs/get",
headers={"content-type": "application/json", "X-Api-Key": os.environ["AIONIS_API_KEY"]},
json={
"tenant_id": "default",
"scope": "support",
"run_id": "<run_uuid>",
"include_steps": True,
"include_artifacts": True,
},
timeout=30,
)
print(replay.json())cURL
bash
curl -sS "$BASE_URL/v1/memory/replay/runs/get" \
-H "X-Api-Key: $AIONIS_API_KEY" \
-H 'content-type: application/json' \
-d '{
"tenant_id":"default",
"scope":"support",
"run_id":"<run_uuid>",
"include_steps":true,
"include_artifacts":true
}' | jqReview:
- failed step index
- tool input/output signatures
- postcondition status
Step 3: Resolve decision and commit evidence
bash
curl -sS "$BASE_URL/v1/memory/resolve" \
-H "X-Api-Key: $AIONIS_API_KEY" \
-H 'content-type: application/json' \
-d '{
"tenant_id":"default",
"scope":"support",
"uri":"aionis://default/support/decision/<decision_id>",
"include_meta":true
}' | jq
curl -sS "$BASE_URL/v1/memory/resolve" \
-H "X-Api-Key: $AIONIS_API_KEY" \
-H 'content-type: application/json' \
-d '{
"tenant_id":"default",
"scope":"support",
"uri":"memory://commit/<commit_id>",
"include_meta":true
}' | jqStep 4: Confirm root-cause hypothesis
Use an evidence table in incident notes:
- expected decision or tool
- actual selected tool and rule sources
- context mismatch or stale memory cause
- remediation patch and owner
Step 5: Verify fix before close
After patching rules or integration logic, rerun the same business case and compare:
- selection outcome
- replay timeline status
- artifacts and postconditions
Expected response sample
json
{
"status": "ok",
"request_id": "req_replay_123",
"tenant_id": "default",
"scope": "support",
"run": {
"run_id": "3d1868e2-e6d3-4f69-952e-61f53ef2ef30",
"status": "failed"
},
"timeline": [
{ "step_index": 1, "status": "success" },
{ "step_index": 2, "status": "failed" }
]
}Common failure and fix
Failure:
json
{"error":"not_found","message":"replay run or playbook not found"}Fix:
- Confirm
tenant_idandscopematch the original failing run. - Validate
run_idformat and value from source logs. - If run record is missing, resolve
decision_idfirst and rebuild chain fromdecision_uri.
Success criteria
replay/runs/getreturns the targetedrun_idwith timeline data.- A failing or divergent step can be clearly identified.
resolvereturns decision/commit evidence tied to the same incident scope.- Post-fix rerun shows expected status change on the critical step.
Incident close checklist
- Root cause linked to concrete IDs.
- Corrective change merged.
- Replay evidence attached.
- Follow-up monitor or gate added.