Engine
The Engine is the core agent runtime. It runs a LangGraph workflow, orchestrates LLM calls via LiteLLM, injects memory context before each planning turn, executes MCP tools with approval policy, and streams all events to the Command Center over SSE.
LangGraph Workflow
The graph has five nodes: entry → memory_prepare → model → gate → final, with conditional routing between them.
START --> [entry] --+--> [memory_prepare] --> [model] ---+--> [gate] --+--> [final] --> END
| | |
+-- if pending tools +-- if tool +-- if awaiting
(skip memory_prepare) --> [gate] calls approval
|
+-- if no tool calls --> [final]
Node Responsibilities
entry — Initialization and reset.
- Clears
awaiting_approvalif resuming from an approval pause. - Resets
memory_context_loadedso retrieval runs fresh on each new user turn. - Checks for cooperative stop (
StopRequested). - Routes to
gateifpending_toolsexist (approval resume path). Otherwise routes tomemory_prepare.
memory_prepare — Context injection before the LLM turn.
- Skipped if
MEMORY_ENABLED=falseormemory_context_loadedis alreadyTrue(prevents double injection on resume paths). - Extracts the latest user message as the retrieval query.
- Runs Postgres full-text search across accessible stores (workspace, user, conversation), re-ranks by trust score, document type boost, entity match, and recency decay.
- Truncates results to
MEMORY_CONTEXT_TOKEN_BUDGET(default 2,200 tokens). - Formats trusted hits as
<skyflo_memory_context>and draft hits as<skyflo_untrusted_memory_candidates>. Appends both as a system message before the LLM sees the user turn. - Emits
memory.context.loadedSSE event with document IDs, store slugs, paths, and trust levels. - On any error: logs a warning and continues to model without blocking. Memory failures never halt the workflow.
model — LLM inference via LiteLLM.
- Loads tool schemas filtered by
loaded_toolsetsstate. The virtualload_toolsettool and memory read tools are always included. Memory write tools requireloaded_toolsets["memory"] == True. - Applies message windowing before each call. See Context Management.
- Streams completion tokens, emitting
token,thinking,ttft,token.usage, andgeneration.completeSSE events. - Uses native stop semantics: tool calls produced → route to
gate; no tool calls → route tofinal.
gate — Tool execution with approval enforcement.
- Intercepts
load_toolsetcalls locally (no MCP round-trip). Updatesloaded_toolsetsstate and returns a confirmation. Next model turn receives the expanded schema set. - Intercepts memory tool calls and dispatches to
MemoryVirtualToolExecutor(bypasses MCP approval gate). - Iterates over remaining
pending_tools, callingToolExecutor.execute()for each. - On
ApprovalPending: returns completed tool results so far, keeps remaining tools inpending_tools, setsawaiting_approval: True, routes tofinal(graph pauses). - On success: clears
pending_tools, appends tool result messages, routes tomodel.
final — Completion.
- Calculates duration.
- Emits
completedSSE event with timing. - Sets
done: Trueand routes toEND.
Stop Condition
The loop uses native stop semantics, matching the pattern used by Anthropic Claude SDK, OpenAI Agents SDK, and LangGraph reference implementations:
- Tool calls produced → route to
gate - No tool calls (text-only response) → route to
final
No separate LLM call is made to determine whether to stop. This eliminates the 1–4 second latency that a judge call would add after every final response.
Context Management
The Engine applies four optimizations to keep per-turn token consumption predictable.
On-Demand Toolset Loading
Instead of sending all tool schemas on every turn, the Engine exposes a virtual load_toolset tool that the LLM calls when it needs a specific toolset.
Default context: k8s read-only tools + memory read tools + load_toolset (~1,500 tokens of tool schemas).
Without optimization: all tool schemas would consume ~8,000 tokens every turn.
The load_toolset tool accepts toolset (k8s, helm, argo, jenkins, memory) and include_write_tools. The gate node intercepts calls locally with no MCP round-trip, updates loaded_toolsets in agent state, and the next model turn receives the expanded schema set.
See Context Management for the full flow and examples.
Prompt Caching
Static content (system prompt and tool schemas) is eligible for provider-side caching.
- OpenAI: Automatic caching for requests with 1,024+ prompt tokens.
- Anthropic (native, Bedrock, Vertex AI): LiteLLM
cache_control_injection_pointsmarks system message and tool config as cacheable. Subsequent turns pay ~10% of the original input cost for cached blocks.
Cache hit rate is visible in the token.usage SSE event via prompt_tokens_details.cached_tokens.
Conversation History Windowing
Before each LLM call, messages are windowed to prevent O(N²) token cost growth:
- System messages: always preserved
- First user message (intent anchor): always preserved
- Last N messages: kept (default 40, configurable via
LLM_CONTEXT_WINDOW_MESSAGES) - Orphaned tool messages (tool results without their preceding
tool_calls): cleaned up
Memory Context Budget
Memory documents injected by memory_prepare are capped at MEMORY_CONTEXT_TOKEN_BUDGET (default 2,200 tokens). Retrieval over-fetches by 3× and trims to budget after re-ranking.
Memory System
The Engine integrates a persistent memory system that gives the agent durable operational context across sessions. See Memory for the full reference.
At the graph level:
memory_prepareruns on every new user turn before the model node.- Memory tool calls in
gatebypass the MCP approval gate and go toMemoryVirtualToolExecutor. - Memory writes are safety-scanned and policy-checked before being persisted.
- The
memory.context.loadedSSE event surfaces what was injected, including trust levels.
SSE Streaming
All events stream from POST /api/v1/agent/chat. Event types:
| Event | Source | Description |
|---|---|---|
ready | SSE generator | Stream initialized with run_id |
ttft | Model node | Time-to-first-token measurement |
thinking | Model node | Reasoning token stream |
thinking.complete | Model node | Full thinking content with duration |
token | Model node | Content token stream |
token.usage | Model node | Token counts and cost (includes cached_tokens) |
generation.start | Model node | LLM call initiated |
generation.complete | Model node | LLM call finished |
tools.pending | Gate node | Tool calls identified, with requires_approval per tool |
tool.awaiting_approval | Tool executor | Mutating tool paused for approval |
tool.approved | Tool executor | Tool approved |
tool.denied | Tool executor | Tool denied |
tool.executing | Tool executor | Tool execution started |
tool.result | Tool executor | Tool execution completed |
tool.error | Tool executor | Tool execution failed |
rate_limit | Model node | Rate limit hit, retrying with backoff |
transient_error | Model node | Transient error, retrying |
memory.context.loaded | Memory prepare | Documents injected; includes IDs, store slugs, paths, trust levels |
memory.search | Memory executor | Agent called memory_search; includes query and result count |
memory.write.created | Memory executor | Agent wrote a memory document |
memory.write.blocked | Memory executor | Write blocked by safety scanner |
memory.policy.denied | Memory executor | Write blocked by policy engine |
memory.promotion.proposed | Memory executor | Agent proposed a draft for admin promotion to workspace |
memory.safety.flagged | Memory executor | Safety scanner found findings during a write attempt |
completed | Final node | Workflow completed with timing |
workflow.error | Graph invoke | Unrecoverable error |
workflow_complete | Endpoint | Terminal event for SSE cleanup |
heartbeat | SSE generator | Keepalive (60s idle timeout) |
conversation.title.generated | Endpoint | Auto-generated conversation title |
Memory events are observable in the Command Center UI. They are not part of tools.pending display.
LLM Integration
The Engine uses LiteLLM as a model-agnostic abstraction layer.
| Setting | Default | Purpose |
|---|---|---|
LLM_MODEL | — | Model identifier (openai/gpt-4o, anthropic/claude-sonnet-4-20250514, etc.) |
LLM_HOST | — | Optional custom API base URL |
LLM_MAX_ITERATIONS | 25 | Recursion limit for graph execution |
LLM_CONTEXT_WINDOW_MESSAGES | 40 | Max messages kept in context per LLM call |
LLM_THINKING_BUDGET_TOKENS | — | Explicit thinking token budget |
LLM_REASONING_EFFORT | high | Reasoning effort level (low, medium, high, default) |
LLM_MAX_TOKENS | — | Max output tokens |
Reasoning Support
The Engine auto-detects reasoning model support via litellm.supports_reasoning(). Configuration priority:
- Explicit
LLM_THINKING_BUDGET_TOKENSenv var - Explicit
LLM_REASONING_EFFORTenv var - Auto-detection (defaults to
reasoning_effort: "high") - Disabled (safe default for unknown models)
Retry Logic
| Error Type | Behavior |
|---|---|
RateLimitError | Exponential backoff up to 60s, max 3 retries |
| Transient (timeout, 502/503/504) | Exponential backoff up to 30s, max 3 retries |
| Other | Immediate failure |
Persistence
The create_event_callback function dual-writes to both Redis (for SSE) and ConversationPersistenceService (database). Persisted data includes:
- Token usage and cost per turn
- Time-to-first-token (TTFT) and time-to-response (TTR)
- Thinking segments
- Text segments
- Tool execution segments (status: pending → executing → completed / error / denied)
- Memory usage records (
conversation_memory_usage— which docs were injected per run)
Authentication
JWT with refresh token rotation. HttpOnly cookies for session. First user to register becomes admin. Conversation access requires owner match or superuser.
Cooperative Stop
Users can stop a running workflow via POST /api/v1/agent/stop. A Redis flag is set. Every 25 streaming tokens and on each node entry, the Engine checks this flag. On detection, StopRequested propagates, the invoke() caller emits completed with status: "stopped", and the flag is cleared.
