Operator Runbook
This runbook defines a practical cadence and thresholds for operating Aionis in production.
Daily
- Health gate (deployment and runtime guard):
cd /Users/lucio/Desktop/Aionis
npm run job:health-gateNote: health gate runs a pre-check embedding_model backfill by default to auto-heal historical READY rows.
- If you want warning-tight mode:
npm run job:health-gate -- --strict-warnings --consistency-check-set scopeExecution-loop aware mode (recommended for policy-heavy deployments):
npm run job:health-gate -- --strict-warnings --consistency-check-set scope --run-execution-loop-gatePolicy-adaptation aware mode (recommended before rule lifecycle changes):
npm run job:health-gate -- --strict-warnings --consistency-check-set scope --run-execution-loop-gate --run-policy-adaptation-gateRun cross-tenant integrity as a separate gate (recommended at least daily, and always before schema/tenant releases):
npm run job:consistency-check:cross-tenant -- --strict-warnings- Lane visibility quick check (rules evaluate):
set -a; source /Users/lucio/Desktop/Aionis/.env; set +a
curl -sS localhost:${PORT:-3001}/v1/memory/rules/evaluate \
-H 'content-type: application/json' \
-d '{
"context":{"intent":"json","provider":"minimax","tool":{"name":"psql"},"agent":{"id":"agent_a","team_id":"team_default"}},
"include_shadow":true,
"limit":50
}' \
| jq '{lane:.agent_visibility_summary.lane, scope_stats:.agent_visibility_summary.rule_scope}'- Lane visibility quick check (tool selector):
set -a; source /Users/lucio/Desktop/Aionis/.env; set +a
curl -sS localhost:${PORT:-3001}/v1/memory/tools/select \
-H 'content-type: application/json' \
-d '{
"context":{"intent":"json","provider":"minimax","tool":{"name":"psql"},"agent":{"id":"agent_a","team_id":"team_default"}},
"candidates":["psql","curl","bash"],
"strict":true,
"include_shadow":true,
"rules_limit":50
}' \
| jq '{selected:.selection.selected, ordered:.selection.ordered, lane:.rules.agent_visibility_summary.lane, scope_stats:.rules.agent_visibility_summary.rule_scope}'Expected (steady state):
lane.applied=truelane.legacy_unowned_private_detected=0scope_stats.filtered_by_laneshould be stable (non-zero can be normal in multi-agent isolation)
- Recall default profile check:
set -a; source /Users/lucio/Desktop/Aionis/.env; set +a
echo "MEMORY_RECALL_PROFILE=${MEMORY_RECALL_PROFILE:-strict_edges}"
echo "MEMORY_RECALL_PROFILE_POLICY_JSON=${MEMORY_RECALL_PROFILE_POLICY_JSON:-{}}"
echo "MEMORY_RECALL_ADAPTIVE_DOWNGRADE_ENABLED=${MEMORY_RECALL_ADAPTIVE_DOWNGRADE_ENABLED:-true}"
echo "MEMORY_RECALL_ADAPTIVE_WAIT_MS=${MEMORY_RECALL_ADAPTIVE_WAIT_MS:-200}"
echo "MEMORY_RECALL_ADAPTIVE_TARGET_PROFILE=${MEMORY_RECALL_ADAPTIVE_TARGET_PROFILE:-strict_edges}"
echo "MEMORY_RECALL_TEXT_CONTEXT_TOKEN_BUDGET_DEFAULT=${MEMORY_RECALL_TEXT_CONTEXT_TOKEN_BUDGET_DEFAULT:-0}"- Throughput profile check/apply:
cd /Users/lucio/Desktop/Aionis
npm run -s env:throughput:prodThis updates only the managed throughput block in .env and keeps existing secrets unchanged.
- Optional context compaction smoke (
recall_text):
curl -sS localhost:${PORT:-3001}/v1/memory/recall_text \
-H 'content-type: application/json' \
-d '{"query_text":"release policy","context_token_budget":600,"context_compaction_profile":"aggressive","return_debug":true}' \
| jq '{context_chars:(.context.text|length), items:(.context.items|length), citations:(.context.citations|length), compaction:.debug.context_compaction}'Weekly
- Long-horizon drift snapshot:
cd /Users/lucio/Desktop/Aionis
npm run job:quality-eval -- --strict- Integrity deep check:
npm run job:consistency-check:scope -- --scope default --strict-warnings
npm run job:consistency-check:cross-tenant -- --strict-warnings
npm run job:execution-loop-gate -- --scope default --strict-warningsFor large datasets where full scan runtime is too high, use fast mode (lower-bound counts) and batch by check index:
npm run job:consistency-check:scope:fast -- --scope default --strict-warnings
npm run job:consistency-check:scope -- --scope default --batch-size 10 --batch-index 0 --strict-warnings
npm run job:consistency-check:scope -- --scope default --batch-size 10 --batch-index 1 --strict-warningsIf private_rule_without_owner is non-zero:
npm run job:private-rule-owner-backfill -- --limit 5000- Governance weekly snapshot (JSON + Markdown):
npm run -s job:governance-weekly-report -- --scope default --window-hours 168For release gate:
npm run -s job:governance-weekly-report -- --scope default --window-hours 168 --strict-warnings- Lifecycle smoke (API + jobs + feedback loop):
npm run e2e:phase4-smoke- Tenant isolation smoke (Phase C):
npm run e2e:phasec-tenant- Auxiliary benchmark regression (non-blocking):
npm run -s env:throughput:benchmark
npm run -s bench:longmemeval:gate
npm run -s bench:locomo -- --sample-limit 1 --qa-limit 20Do not use benchmark failures above as release blockers. Treat them as auxiliary drift signals.
Suggested Thresholds
Use these as default SLO-style boundaries. Tune by scope once traffic stabilizes.
quality.metrics.embedding_ready_ratio >= 0.80quality.metrics.alias_rate <= 0.30quality.metrics.archive_ratio <= 0.95quality.metrics.fresh_30d_ratio >= 0.20consistency.summary.errors == 0(always)consistency.summary.warnings == 0(recommended for production gate)embedding_model_invalid_for_ready == 0(nounknown:*model labels)tenant_scope_key_malformed == 0and allcross_tenant_* == 0policy_adaptation.summary.urgent_disable_candidates == 0
Incident Playbook
- If
quality_eval_failed:
- Run
npm run job:quality-evaland inspectsummary.failed(orfailed_checksinjob:health-gateoutput). - Run
npm run job:salience-decayand re-check. - If failure is
ready_ratiorelated, inspect embedding backfill and outbox worker.
- If consistency errors appear:
- Run
npm run job:consistency-check:scope -- --scope defaultand inspect the failing check names. - If tenant integrity may be involved, run
npm run job:consistency-check:cross-tenant. - Verify migrations are up to date:
make db-migrate. - For outbox-related failures, run
npm run job:outbox-worker -- --onceand then replay failed items if needed.
- If archive/activation behavior regresses:
- Run
npm run e2e:phase4-smoketo reproduce end-to-end. - Validate
last_rehydrated_*andfeedback_*slot markers for the test node.
- If
recall_textstarts returningupstream_embedding_rate_limited/upstream_embedding_unavailable:
- Verify provider quotas first.
- Check whether query embedding cache is enabled:
RECALL_TEXT_EMBED_CACHE_ENABLED=trueRECALL_TEXT_EMBED_CACHE_TTL_MS/RECALL_TEXT_EMBED_CACHE_MAX_KEYSsized for traffic.
- Temporarily reduce upstream pressure by lowering caller concurrency or increasing repeated-query cache hit ratio.
Release Gate Recommendation
Before production deploy:
cd /Users/lucio/Desktop/Aionis
npm run -s gate:core:prod -- \
--base-url "http://localhost:${PORT:-3001}" \
--scope default \
--require-partition-ready true \
--partition-dual-write-enabled true \
--partition-read-shadow-check true \
--run-perf true \
--recall-p95-max-ms 1200 \
--write-p95-max-ms 800 \
--error-rate-max 0.02Only deploy when the core gate passes. If you are pre-cutover or running rehearsal only, set --require-partition-ready false.
Verification Stamp
- Last reviewed:
2026-02-18 - Verification commands:
npm run docs:checknpm run job:health-gate -- --strict-warnings --consistency-check-set scopenpm run job:consistency-check:cross-tenant -- --strict-warnings