Context Management
Skyflo applies four optimizations to keep per-turn token consumption predictable and prevent unbounded context growth in long conversations. These operate at the Engine layer, below the agent's reasoning.
Token Budget at a Glance
| Component | Unoptimized | Optimization | Optimized |
|---|---|---|---|
| System prompt | ~2,200 tokens | Prompt caching | ~220 effective* |
| Tool schemas (all toolsets) | ~8,000 tokens | On-demand toolset loading | ~1,500 (k8s read-only + memory read) |
| User message | ~50–200 tokens | None | ~50–200 tokens |
| Memory context | Unbounded | Token budget cap | ≤ 2,200 tokens (configurable) |
| Conversation history | Unbounded | Windowing | ≤ last 40 messages (configurable) |
*Effective cost after provider-side prompt cache hits.
On-Demand Toolset Loading
The largest per-turn saving. Instead of sending all tool schemas on every request, the Engine exposes a virtual load_toolset tool that the LLM calls only when it determines a specific toolset is needed.
Default Context
When a conversation starts, the agent has access to:
- Kubernetes read-only tools (k8s tag,
include_write_tools=false) - Memory read tools (
memory_search,memory_read,memory_list,memory_history) - The virtual
load_toolsettool itself
This default context costs approximately 1,500 tokens in tool schemas, versus ~8,000 tokens if all toolsets were loaded.
Available Toolsets
| Toolset | Tag | Default loaded |
|---|---|---|
| Kubernetes | k8s | Yes (read-only) |
| Helm | helm | No |
| Argo Rollouts | argo | No |
| Jenkins | jenkins | No |
| Memory (write tools) | memory | No |
How It Works
When the LLM determines it needs a toolset not currently loaded, it calls load_toolset:
LLM → load_toolset(toolset="helm", include_write_tools=false)
Gate node intercepts (no MCP round-trip)
→ updates loaded_toolsets: {"k8s": false, "helm": false}
→ returns confirmation message to LLM
Next model turn → get_llm_compatible_tools() includes helm read-only schemas
LLM → helm_list_releases(...)
The include_write_tools parameter maps to the MCP tool readOnlyHint annotation. Setting it to true adds mutation tools for that toolset. Setting it to false limits the LLM to read-only operations within that category.
The gate node handles load_toolset calls entirely in-process. No network call to the MCP server. No approval required.
Examples
| Query | What the LLM does |
|---|---|
Get pods in namespace prod | Uses k8s_get directly. No load_toolset call needed. |
Show my helm releases | Calls load_toolset(toolset="helm", include_write_tools=false), then helm_list_releases. |
Scale deployment to 3 replicas | Calls load_toolset(toolset="k8s", include_write_tools=true), then k8s_scale (pending approval). |
Rollback the argo rollout | Calls load_toolset(toolset="argo", include_write_tools=true), then argo_undo (pending approval). |
Save this runbook to memory | Calls load_toolset(toolset="memory", include_write_tools=true), then memory_remember. |
This avoids hardcoded keyword matching. The LLM uses its own understanding of the query to request what it needs.
load_toolset is sticky within a conversation turn sequence. Once Helm is loaded, it stays loaded for subsequent model turns in the same run. It does not persist across separate HTTP requests unless the LangGraph checkpoint restores loaded_toolsets state.
Prompt Caching
Static content — the system prompt and tool schemas — is eligible for provider-side caching. This is transparent to the agent but significantly reduces effective input token cost for repeated turns.
OpenAI
Automatic caching for requests with 1,024 or more prompt tokens. No configuration required. Cache hits appear in prompt_tokens_details.cached_tokens in the token.usage SSE event.
Anthropic (Native, Bedrock, Vertex AI)
LiteLLM cache_control_injection_points marks the system message and tool configuration as cacheable. After the first turn in a conversation, subsequent calls pay approximately 10% of the original input token cost for cached blocks. Applies uniformly across the native Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI.
Cache hit rate is tracked in token.usage events and persisted to the conversation record.
Conversation History Windowing
Before each LLM call, the message list is windowed to prevent unbounded token growth in long multi-turn conversations.
Rules
- System messages: Always preserved.
- First user message: Always preserved. This is the intent anchor — the original request that scopes the whole workflow.
- Last N messages: Kept. Configurable via
LLM_CONTEXT_WINDOW_MESSAGES(default 40). - Messages outside the window: Dropped before the LLM call. Not deleted from the checkpoint.
- Orphaned tool messages: Cleaned up. Tool results without their preceding
assistanttool_callsmessage would confuse the model; they are removed from the windowed slice.
Without windowing, per-turn token cost grows O(N²) with conversation length. With a window of 40 messages, cost is effectively O(1) for any conversation beyond ~20 turns.
Windowing drops messages from the LLM's view, not from the checkpoint. Full conversation history is always preserved in the database. The agent reasons over a window; the audit trail is complete.
Memory Context Budget
The memory_prepare node retrieves documents from Postgres and injects them before each planning turn. The retrieval pipeline over-fetches by 3× and trims to the MEMORY_CONTEXT_TOKEN_BUDGET (default 2,200 tokens) after re-ranking.
This budget interacts with the other optimizations. A typical per-turn context footprint with all optimizations active:
System prompt (cached): ~220 tokens effective
Tool schemas (k8s default): ~1,500 tokens
Memory context: ≤ 2,200 tokens
User message: ~100 tokens (typical)
─────────────────────────────────────────────────
Total per new turn: ~4,020 tokens effective
Compare to the unoptimized case (no caching, all schemas loaded, no memory budget):
System prompt (no cache): ~2,200 tokens
Tool schemas (all): ~8,000 tokens
Memory context (no cap): unbounded
User message: ~100 tokens
Configuration Reference
| Setting | Default | Purpose |
|---|---|---|
LLM_CONTEXT_WINDOW_MESSAGES | 40 | Max messages kept in context per LLM call |
MEMORY_CONTEXT_TOKEN_BUDGET | 2200 | Max tokens injected from memory per turn |
MEMORY_ENABLED | true | Toggle memory context injection on/off |
LLM_MAX_ITERATIONS | 25 | Graph recursion limit (bounds total tool call rounds per request) |
