Tutorial: Incident Replay for a Production Failure

Use Aionis replay primitives to reconstruct and verify a failed workflow.

Before you start

You have observed a failed or suspicious run in logs.
You can access run_id or decision_id.
You know the tenant_id and scope for the incident.

What you will finish with

A replay evidence chain you can attach to incident review and use to verify the fix.

Tip - Copy and run Use the copy button on each code block. For consistent environment variables, run One-click Environment Template.

Input

Input fields

Field	Required	Used in steps	Example
`BASE_URL`	Yes	2, 3	`http://localhost:3001`
`AIONIS_API_KEY`	Yes	2, 3	`aionis_live_xxx`
`tenant_id`	Yes	2, 3	`default`
`scope`	Yes	2, 3	`support`
`run_id`	Yes	2	`3d1868e2-e6d3-4f69-952e-61f53ef2ef30`
`decision_id`	Conditional	3	`8fe92f61-...`
`commit_id`	Conditional	3	`commit_xxx`

Output fields to persist

Field	Source step	Why keep it
`request_id`	2, 3	Incident timeline correlation
`run_id`	2	Replay run anchor
`timeline[]`	2	Step-by-step incident evidence
`steps[]`	2	Failure boundary and postconditions
`artifacts[]`	2	Forensics and audit attachment
`resolved` payload	3	Decision/commit root-cause verification

Steps

Step 1: Collect incident identifiers

From app and gateway logs, collect:

request_id
run_id
decision_id
commit_uri
tenant_id
scope

Step 2: Pull replay timeline

TypeScript

const replayRes = await fetch(`${process.env.BASE_URL}/v1/memory/replay/runs/get`, {
  method: 'POST',
  headers: {
    'content-type': 'application/json',
    'x-api-key': process.env.AIONIS_API_KEY!
  },
  body: JSON.stringify({
    tenant_id: 'default',
    scope: 'support',
    run_id: '<run_uuid>',
    include_steps: true,
    include_artifacts: true
  })
})

const replay = await replayRes.json()
console.log(replay)

Python

python

import os
import requests

replay = requests.post(
    f"{os.environ['BASE_URL']}/v1/memory/replay/runs/get",
    headers={"content-type": "application/json", "X-Api-Key": os.environ["AIONIS_API_KEY"]},
    json={
        "tenant_id": "default",
        "scope": "support",
        "run_id": "<run_uuid>",
        "include_steps": True,
        "include_artifacts": True,
    },
    timeout=30,
)

print(replay.json())

cURL

bash

curl -sS "$BASE_URL/v1/memory/replay/runs/get" \
  -H "X-Api-Key: $AIONIS_API_KEY" \
  -H 'content-type: application/json' \
  -d '{
    "tenant_id":"default",
    "scope":"support",
    "run_id":"<run_uuid>",
    "include_steps":true,
    "include_artifacts":true
  }' | jq

Review:

failed step index
tool input/output signatures
postcondition status

Step 3: Resolve decision and commit evidence

bash

curl -sS "$BASE_URL/v1/memory/resolve" \
  -H "X-Api-Key: $AIONIS_API_KEY" \
  -H 'content-type: application/json' \
  -d '{
    "tenant_id":"default",
    "scope":"support",
    "uri":"aionis://default/support/decision/<decision_id>",
    "include_meta":true
  }' | jq

curl -sS "$BASE_URL/v1/memory/resolve" \
  -H "X-Api-Key: $AIONIS_API_KEY" \
  -H 'content-type: application/json' \
  -d '{
    "tenant_id":"default",
    "scope":"support",
    "uri":"memory://commit/<commit_id>",
    "include_meta":true
  }' | jq

Step 4: Confirm root-cause hypothesis

Use an evidence table in incident notes:

expected decision or tool
actual selected tool and rule sources
context mismatch or stale memory cause
remediation patch and owner

Step 5: Verify fix before close

After patching rules or integration logic, rerun the same business case and compare:

selection outcome
replay timeline status
artifacts and postconditions

Expected response sample

json

{
  "status": "ok",
  "request_id": "req_replay_123",
  "tenant_id": "default",
  "scope": "support",
  "run": {
    "run_id": "3d1868e2-e6d3-4f69-952e-61f53ef2ef30",
    "status": "failed"
  },
  "timeline": [
    { "step_index": 1, "status": "success" },
    { "step_index": 2, "status": "failed" }
  ]
}

Common failure and fix

Failure:

json

{"error":"not_found","message":"replay run or playbook not found"}

Fix:

Confirm tenant_id and scope match the original failing run.
Validate run_id format and value from source logs.
If run record is missing, resolve decision_id first and rebuild chain from decision_uri.

Success criteria

replay/runs/get returns the targeted run_id with timeline data.
A failing or divergent step can be clearly identified.
resolve returns decision/commit evidence tied to the same incident scope.
Post-fix rerun shows expected status change on the critical step.

Incident close checklist

Root cause linked to concrete IDs.
Corrective change merged.
Replay evidence attached.
Follow-up monitor or gate added.

Tutorial: Incident Replay for a Production Failure ​

Before you start ​

What you will finish with ​

Input ​

Input fields ​

Output fields to persist ​

Steps ​

Step 1: Collect incident identifiers ​

Step 2: Pull replay timeline ​

TypeScript ​

Python ​

cURL ​

Step 3: Resolve decision and commit evidence ​

Step 4: Confirm root-cause hypothesis ​

Step 5: Verify fix before close ​

Expected response sample ​

Common failure and fix ​

Success criteria ​

Incident close checklist ​