Engine

The Engine is the core agent runtime. It runs a LangGraph workflow, orchestrates LLM calls via LiteLLM, injects memory context before each planning turn, executes MCP tools with approval policy, and streams all events to the Command Center over SSE.

LangGraph Workflow

The graph has five nodes: entry → memory_prepare → model → gate → final, with conditional routing between them.

START --> [entry] --+--> [memory_prepare] --> [model] ---+--> [gate] --+--> [final] --> END
                    |                                    |             |
                    +-- if pending tools                 +-- if tool   +-- if awaiting
                        (skip memory_prepare) --> [gate]     calls         approval
                                                         |
                                                         +-- if no tool calls --> [final]

Node Responsibilities

entry — Initialization and reset.

Clears awaiting_approval if resuming from an approval pause.
Resets memory_context_loaded so retrieval runs fresh on each new user turn.
Checks for cooperative stop (StopRequested).
Routes to gate if pending_tools exist (approval resume path). Otherwise routes to memory_prepare.

memory_prepare — Context injection before the LLM turn.

Skipped if MEMORY_ENABLED=false or memory_context_loaded is already True (prevents double injection on resume paths).
Extracts the latest user message as the retrieval query.
Runs Postgres full-text search across accessible stores (workspace, user, conversation), re-ranks by trust score, document type boost, entity match, and recency decay.
Truncates results to MEMORY_CONTEXT_TOKEN_BUDGET (default 2,200 tokens).
Formats trusted hits as <skyflo_memory_context> and draft hits as <skyflo_untrusted_memory_candidates>. Appends both as a system message before the LLM sees the user turn.
Emits memory.context.loaded SSE event with document IDs, store slugs, paths, and trust levels.
On any error: logs a warning and continues to model without blocking. Memory failures never halt the workflow.

model — LLM inference via LiteLLM.

Loads tool schemas filtered by loaded_toolsets state. The virtual load_toolset tool and memory read tools are always included. Memory write tools require loaded_toolsets["memory"] == True.
Applies message windowing before each call. See Context Management.
Streams completion tokens, emitting token, thinking, ttft, token.usage, and generation.complete SSE events.
Uses native stop semantics: tool calls produced → route to gate; no tool calls → route to final.

gate — Tool execution with approval enforcement.

Intercepts load_toolset calls locally (no MCP round-trip). Updates loaded_toolsets state and returns a confirmation. Next model turn receives the expanded schema set.
Intercepts memory tool calls and dispatches to MemoryVirtualToolExecutor (bypasses MCP approval gate).
Iterates over remaining pending_tools, calling ToolExecutor.execute() for each.
On ApprovalPending: returns completed tool results so far, keeps remaining tools in pending_tools, sets awaiting_approval: True, routes to final (graph pauses).
On success: clears pending_tools, appends tool result messages, routes to model.

final — Completion.

Calculates duration.
Emits completed SSE event with timing.
Sets done: True and routes to END.

Stop Condition

The loop uses native stop semantics, matching the pattern used by Anthropic Claude SDK, OpenAI Agents SDK, and LangGraph reference implementations:

Tool calls produced → route to gate
No tool calls (text-only response) → route to final

No separate LLM call is made to determine whether to stop. This eliminates the 1–4 second latency that a judge call would add after every final response.

Context Management

The Engine applies four optimizations to keep per-turn token consumption predictable.

On-Demand Toolset Loading

Instead of sending all tool schemas on every turn, the Engine exposes a virtual load_toolset tool that the LLM calls when it needs a specific toolset.

Default context: k8s read-only tools + memory read tools + load_toolset (~1,500 tokens of tool schemas).

Without optimization: all tool schemas would consume ~8,000 tokens every turn.

The load_toolset tool accepts toolset (k8s, helm, argo, jenkins, memory) and include_write_tools. The gate node intercepts calls locally with no MCP round-trip, updates loaded_toolsets in agent state, and the next model turn receives the expanded schema set.

See Context Management for the full flow and examples.

Prompt Caching

Static content (system prompt and tool schemas) is eligible for provider-side caching.

OpenAI: Automatic caching for requests with 1,024+ prompt tokens.
Anthropic (native, Bedrock, Vertex AI): LiteLLM cache_control_injection_points marks system message and tool config as cacheable. Subsequent turns pay ~10% of the original input cost for cached blocks.

Cache hit rate is visible in the token.usage SSE event via prompt_tokens_details.cached_tokens.

Conversation History Windowing

Before each LLM call, messages are windowed to prevent O(N²) token cost growth:

System messages: always preserved
First user message (intent anchor): always preserved
Last N messages: kept (default 40, configurable via LLM_CONTEXT_WINDOW_MESSAGES)
Orphaned tool messages (tool results without their preceding tool_calls): cleaned up

Memory Context Budget

Memory documents injected by memory_prepare are capped at MEMORY_CONTEXT_TOKEN_BUDGET (default 2,200 tokens). Retrieval over-fetches by 3× and trims to budget after re-ranking.

Memory System

The Engine integrates a persistent memory system that gives the agent durable operational context across sessions. See Memory for the full reference.

At the graph level:

memory_prepare runs on every new user turn before the model node.
Memory tool calls in gate bypass the MCP approval gate and go to MemoryVirtualToolExecutor.
Memory writes are safety-scanned and policy-checked before being persisted.
The memory.context.loaded SSE event surfaces what was injected, including trust levels.

SSE Streaming

All events stream from POST /api/v1/agent/chat. Event types:

Event	Source	Description
`ready`	SSE generator	Stream initialized with `run_id`
`ttft`	Model node	Time-to-first-token measurement
`thinking`	Model node	Reasoning token stream
`thinking.complete`	Model node	Full thinking content with duration
`token`	Model node	Content token stream
`token.usage`	Model node	Token counts and cost (includes `cached_tokens`)
`generation.start`	Model node	LLM call initiated
`generation.complete`	Model node	LLM call finished
`tools.pending`	Gate node	Tool calls identified, with `requires_approval` per tool
`tool.awaiting_approval`	Tool executor	Mutating tool paused for approval
`tool.approved`	Tool executor	Tool approved
`tool.denied`	Tool executor	Tool denied
`tool.executing`	Tool executor	Tool execution started
`tool.result`	Tool executor	Tool execution completed
`tool.error`	Tool executor	Tool execution failed
`rate_limit`	Model node	Rate limit hit, retrying with backoff
`transient_error`	Model node	Transient error, retrying
`memory.context.loaded`	Memory prepare	Documents injected; includes IDs, store slugs, paths, trust levels
`memory.search`	Memory executor	Agent called `memory_search`; includes query and result count
`memory.write.created`	Memory executor	Agent wrote a memory document
`memory.write.blocked`	Memory executor	Write blocked by safety scanner
`memory.policy.denied`	Memory executor	Write blocked by policy engine
`memory.promotion.proposed`	Memory executor	Agent proposed a draft for admin promotion to workspace
`memory.safety.flagged`	Memory executor	Safety scanner found findings during a write attempt
`completed`	Final node	Workflow completed with timing
`workflow.error`	Graph invoke	Unrecoverable error
`workflow_complete`	Endpoint	Terminal event for SSE cleanup
`heartbeat`	SSE generator	Keepalive (60s idle timeout)
`conversation.title.generated`	Endpoint	Auto-generated conversation title

Memory events are observable in the Command Center UI. They are not part of tools.pending display.

LLM Integration

The Engine uses LiteLLM as a model-agnostic abstraction layer.

Setting	Default	Purpose
`LLM_MODEL`	—	Model identifier (`openai/gpt-4o`, `anthropic/claude-sonnet-4-20250514`, etc.)
`LLM_HOST`	—	Optional custom API base URL
`LLM_MAX_ITERATIONS`	25	Recursion limit for graph execution
`LLM_CONTEXT_WINDOW_MESSAGES`	40	Max messages kept in context per LLM call
`LLM_THINKING_BUDGET_TOKENS`	—	Explicit thinking token budget
`LLM_REASONING_EFFORT`	`high`	Reasoning effort level (`low`, `medium`, `high`, `default`)
`LLM_MAX_TOKENS`	—	Max output tokens

Reasoning Support

The Engine auto-detects reasoning model support via litellm.supports_reasoning(). Configuration priority:

Explicit LLM_THINKING_BUDGET_TOKENS env var
Explicit LLM_REASONING_EFFORT env var
Auto-detection (defaults to reasoning_effort: "high")
Disabled (safe default for unknown models)

Retry Logic

Error Type	Behavior
`RateLimitError`	Exponential backoff up to 60s, max 3 retries
Transient (timeout, 502/503/504)	Exponential backoff up to 30s, max 3 retries
Other	Immediate failure

Persistence

The create_event_callback function dual-writes to both Redis (for SSE) and ConversationPersistenceService (database). Persisted data includes:

Token usage and cost per turn
Time-to-first-token (TTFT) and time-to-response (TTR)
Thinking segments
Text segments
Tool execution segments (status: pending → executing → completed / error / denied)
Memory usage records (conversation_memory_usage — which docs were injected per run)

Authentication

JWT with refresh token rotation. HttpOnly cookies for session. First user to register becomes admin. Conversation access requires owner match or superuser.

Cooperative Stop

Users can stop a running workflow via POST /api/v1/agent/stop. A Redis flag is set. Every 25 streaming tokens and on each node entry, the Engine checks this flag. On detection, StopRequested propagates, the invoke() caller emits completed with status: "stopped", and the flag is cleared.