Adaptive Compression Plan

This plan upgrades Aionis from static compression rollup to a budget-driven adaptive compression engine while keeping auditability and stable production latency.

Why

Current state is solid but limited:

Compression rollup exists and is non-destructive.
recall_text already prefers summary-first rendering.
Compression gain is material but modest in baseline docs (4699 -> 4244 context chars).

Target state:

Per-request token budget control.
Adaptive context compaction under load/large context.
Measured compression KPI in CI and release gate.

Scope

In scope:

Open Core API and recall path.
Context rendering and compaction policy.
SDK input type alignment (TypeScript + Python).
Operator/docs/runbook updates.

Out of scope:

Hosted-only internal policy engine.
Model-specific semantic summarization service.
Breaking response schema changes.

Success Metrics

Primary:

context_compression_ratio >= 0.40 on representative long-memory workload.
answer_quality_retain >= 0.95 versus non-compressed baseline (task-specific eval set).
recall_text p95 regression <= +10%.

Secondary:

rate_limited_recall_text_embed does not regress in steady traffic.
consistency checks remain clean for compression citations.

Rollout Phases

Phase 1 (Now): Budget-Driven Context Compaction

Status: completed

Deliverables:

Add request knobs:
- context_token_budget?: number
- context_char_budget?: number
Add server default:
- MEMORY_RECALL_TEXT_CONTEXT_TOKEN_BUDGET_DEFAULT (0 disables).
Apply compaction only to context.text, preserving items and citations.
Add observability fields in recall logs:
- context length, estimated tokens, budget, compaction applied.

Acceptance:

Existing clients keep working with no request changes.
With budget set, context text shrinks deterministically.
Build/contract/docs/sdk checks pass.

Phase 2: Multi-Level Compression Strategy

Status: completed

Deliverables:

Add section-level policy presets (balanced, aggressive).
Prefer topic/concept and rule lines before event fanout under tight budgets.
Add compaction diagnostics in debug block (bounded metadata only).

Acceptance:

Compression ratio improves beyond Phase 1 on long contexts.
No citation traceability regression.

Phase 3: Production Gate and Benchmark Standardization

Status: completed

Deliverables:

Add compression KPI check into gate:core:prod (non-blocking first, then blocking).
Add production-style benchmark profile for compression.
Keep LoCoMo/LongMemEval as auxiliary regression only.

Acceptance:

Release evidence includes compression KPIs.
Gate fails when compression or quality thresholds breach (after stabilization period).

Risk and Guardrails

Risk: over-compression hurts answer quality. Control: section-priority policy + quality-retain metric.
Risk: unpredictable output shape. Control: only compact context.text; keep items and citations stable.
Risk: per-model token mismatch. Control: use conservative token estimator first; calibrate by model family later.

Execution Checklist

Implement Phase 1 code path and API schema.
Update API contract and operator runbook.
Align SDK input types.
Run:
- npm run -s build
- npm run -s test:contract
- npm run -s docs:check
- npm run -s sdk:release-check
- npm run -s sdk:py:release-check
Ship and observe one release cycle before Phase 2.

Adaptive Compression Plan ​

Why ​

Scope ​

Success Metrics ​

Rollout Phases ​

Phase 1 (Now): Budget-Driven Context Compaction ​

Phase 2: Multi-Level Compression Strategy ​

Phase 3: Production Gate and Benchmark Standardization ​

Risk and Guardrails ​

Execution Checklist ​

Adaptive Compression Plan

Why

Scope

Success Metrics

Rollout Phases

Phase 1 (Now): Budget-Driven Context Compaction

Phase 2: Multi-Level Compression Strategy

Phase 3: Production Gate and Benchmark Standardization

Risk and Guardrails

Execution Checklist