Context Management

Skyflo applies four optimizations to keep per-turn token consumption predictable and prevent unbounded context growth in long conversations. These operate at the Engine layer, below the agent's reasoning.

Token Budget at a Glance

Component	Unoptimized	Optimization	Optimized
System prompt	~2,200 tokens	Prompt caching	~220 effective*
Tool schemas (all toolsets)	~8,000 tokens	On-demand toolset loading	~1,500 (k8s read-only + memory read)
User message	~50–200 tokens	None	~50–200 tokens
Memory context	Unbounded	Token budget cap	≤ 2,200 tokens (configurable)
Conversation history	Unbounded	Windowing	≤ last 40 messages (configurable)

*Effective cost after provider-side prompt cache hits.

On-Demand Toolset Loading

The largest per-turn saving. Instead of sending all tool schemas on every request, the Engine exposes a virtual load_toolset tool that the LLM calls only when it determines a specific toolset is needed.

Default Context

When a conversation starts, the agent has access to:

Kubernetes read-only tools (k8s tag, include_write_tools=false)
Memory read tools (memory_search, memory_read, memory_list, memory_history)
The virtual load_toolset tool itself

This default context costs approximately 1,500 tokens in tool schemas, versus ~8,000 tokens if all toolsets were loaded.

Available Toolsets

Toolset	Tag	Default loaded
Kubernetes	`k8s`	Yes (read-only)
Helm	`helm`	No
Argo Rollouts	`argo`	No
Jenkins	`jenkins`	No
Memory (write tools)	`memory`	No

How It Works

When the LLM determines it needs a toolset not currently loaded, it calls load_toolset:

LLM → load_toolset(toolset="helm", include_write_tools=false)
Gate node intercepts (no MCP round-trip)
  → updates loaded_toolsets: {"k8s": false, "helm": false}
  → returns confirmation message to LLM
Next model turn → get_llm_compatible_tools() includes helm read-only schemas
LLM → helm_list_releases(...)

The include_write_tools parameter maps to the MCP tool readOnlyHint annotation. Setting it to true adds mutation tools for that toolset. Setting it to false limits the LLM to read-only operations within that category.

The gate node handles load_toolset calls entirely in-process. No network call to the MCP server. No approval required.

Examples

Query	What the LLM does
`Get pods in namespace prod`	Uses `k8s_get` directly. No `load_toolset` call needed.
`Show my helm releases`	Calls `load_toolset(toolset="helm", include_write_tools=false)`, then `helm_list_releases`.
`Scale deployment to 3 replicas`	Calls `load_toolset(toolset="k8s", include_write_tools=true)`, then `k8s_scale` (pending approval).
`Rollback the argo rollout`	Calls `load_toolset(toolset="argo", include_write_tools=true)`, then `argo_undo` (pending approval).
`Save this runbook to memory`	Calls `load_toolset(toolset="memory", include_write_tools=true)`, then `memory_remember`.

This avoids hardcoded keyword matching. The LLM uses its own understanding of the query to request what it needs.

load_toolset is sticky within a conversation turn sequence. Once Helm is loaded, it stays loaded for subsequent model turns in the same run. It does not persist across separate HTTP requests unless the LangGraph checkpoint restores loaded_toolsets state.

Prompt Caching

Static content — the system prompt and tool schemas — is eligible for provider-side caching. This is transparent to the agent but significantly reduces effective input token cost for repeated turns.

OpenAI

Automatic caching for requests with 1,024 or more prompt tokens. No configuration required. Cache hits appear in prompt_tokens_details.cached_tokens in the token.usage SSE event.

Anthropic (Native, Bedrock, Vertex AI)

LiteLLM cache_control_injection_points marks the system message and tool configuration as cacheable. After the first turn in a conversation, subsequent calls pay approximately 10% of the original input token cost for cached blocks. Applies uniformly across the native Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI.

Cache hit rate is tracked in token.usage events and persisted to the conversation record.

Conversation History Windowing

Before each LLM call, the message list is windowed to prevent unbounded token growth in long multi-turn conversations.

Rules

System messages: Always preserved.
First user message: Always preserved. This is the intent anchor — the original request that scopes the whole workflow.
Last N messages: Kept. Configurable via LLM_CONTEXT_WINDOW_MESSAGES (default 40).
Messages outside the window: Dropped before the LLM call. Not deleted from the checkpoint.
Orphaned tool messages: Cleaned up. Tool results without their preceding assistant tool_calls message would confuse the model; they are removed from the windowed slice.

Without windowing, per-turn token cost grows O(N²) with conversation length. With a window of 40 messages, cost is effectively O(1) for any conversation beyond ~20 turns.

Windowing drops messages from the LLM's view, not from the checkpoint. Full conversation history is always preserved in the database. The agent reasons over a window; the audit trail is complete.

Memory Context Budget

The memory_prepare node retrieves documents from Postgres and injects them before each planning turn. The retrieval pipeline over-fetches by 3× and trims to the MEMORY_CONTEXT_TOKEN_BUDGET (default 2,200 tokens) after re-ranking.

This budget interacts with the other optimizations. A typical per-turn context footprint with all optimizations active:

System prompt (cached):    ~220 tokens effective
Tool schemas (k8s default): ~1,500 tokens
Memory context:             ≤ 2,200 tokens
User message:               ~100 tokens (typical)
─────────────────────────────────────────────────
Total per new turn:         ~4,020 tokens effective

Compare to the unoptimized case (no caching, all schemas loaded, no memory budget):

System prompt (no cache):  ~2,200 tokens
Tool schemas (all):         ~8,000 tokens
Memory context (no cap):    unbounded
User message:               ~100 tokens

Configuration Reference

Setting	Default	Purpose
`LLM_CONTEXT_WINDOW_MESSAGES`	`40`	Max messages kept in context per LLM call
`MEMORY_CONTEXT_TOKEN_BUDGET`	`2200`	Max tokens injected from memory per turn
`MEMORY_ENABLED`	`true`	Toggle memory context injection on/off
`LLM_MAX_ITERATIONS`	`25`	Graph recursion limit (bounds total tool call rounds per request)