Engine

The Engine is the core agent runtime. It runs a LangGraph workflow, orchestrates LLM calls via LiteLLM, injects memory context before each planning turn, executes MCP tools with approval policy, and streams all events to the Command Center over SSE.

LangGraph Workflow

The graph has five nodes: entry → memory_prepare → model → gate → final, with conditional routing between them.

START --> [entry] --+--> [memory_prepare] --> [model] ---+--> [gate] --+--> [final] --> END
                    |                                    |             |
                    +-- if pending tools                 +-- if tool   +-- if awaiting
                        (skip memory_prepare) --> [gate]     calls         approval
                                                         |
                                                         +-- if no tool calls --> [final]

Node Responsibilities

entry — Initialization and reset.

  • Clears awaiting_approval if resuming from an approval pause.
  • Resets memory_context_loaded so retrieval runs fresh on each new user turn.
  • Checks for cooperative stop (StopRequested).
  • Routes to gate if pending_tools exist (approval resume path). Otherwise routes to memory_prepare.

memory_prepare — Context injection before the LLM turn.

  • Skipped if MEMORY_ENABLED=false or memory_context_loaded is already True (prevents double injection on resume paths).
  • Extracts the latest user message as the retrieval query.
  • Runs Postgres full-text search across accessible stores (workspace, user, conversation), re-ranks by trust score, document type boost, entity match, and recency decay.
  • Truncates results to MEMORY_CONTEXT_TOKEN_BUDGET (default 2,200 tokens).
  • Formats trusted hits as <skyflo_memory_context> and draft hits as <skyflo_untrusted_memory_candidates>. Appends both as a system message before the LLM sees the user turn.
  • Emits memory.context.loaded SSE event with document IDs, store slugs, paths, and trust levels.
  • On any error: logs a warning and continues to model without blocking. Memory failures never halt the workflow.

model — LLM inference via LiteLLM.

  • Loads tool schemas filtered by loaded_toolsets state. The virtual load_toolset tool and memory read tools are always included. Memory write tools require loaded_toolsets["memory"] == True.
  • Applies message windowing before each call. See Context Management.
  • Streams completion tokens, emitting token, thinking, ttft, token.usage, and generation.complete SSE events.
  • Uses native stop semantics: tool calls produced → route to gate; no tool calls → route to final.

gate — Tool execution with approval enforcement.

  • Intercepts load_toolset calls locally (no MCP round-trip). Updates loaded_toolsets state and returns a confirmation. Next model turn receives the expanded schema set.
  • Intercepts memory tool calls and dispatches to MemoryVirtualToolExecutor (bypasses MCP approval gate).
  • Iterates over remaining pending_tools, calling ToolExecutor.execute() for each.
  • On ApprovalPending: returns completed tool results so far, keeps remaining tools in pending_tools, sets awaiting_approval: True, routes to final (graph pauses).
  • On success: clears pending_tools, appends tool result messages, routes to model.

final — Completion.

  • Calculates duration.
  • Emits completed SSE event with timing.
  • Sets done: True and routes to END.

Stop Condition

The loop uses native stop semantics, matching the pattern used by Anthropic Claude SDK, OpenAI Agents SDK, and LangGraph reference implementations:

  • Tool calls produced → route to gate
  • No tool calls (text-only response) → route to final

No separate LLM call is made to determine whether to stop. This eliminates the 1–4 second latency that a judge call would add after every final response.

Context Management

The Engine applies four optimizations to keep per-turn token consumption predictable.

On-Demand Toolset Loading

Instead of sending all tool schemas on every turn, the Engine exposes a virtual load_toolset tool that the LLM calls when it needs a specific toolset.

Default context: k8s read-only tools + memory read tools + load_toolset (~1,500 tokens of tool schemas).

Without optimization: all tool schemas would consume ~8,000 tokens every turn.

The load_toolset tool accepts toolset (k8s, helm, argo, jenkins, memory) and include_write_tools. The gate node intercepts calls locally with no MCP round-trip, updates loaded_toolsets in agent state, and the next model turn receives the expanded schema set.

See Context Management for the full flow and examples.

Prompt Caching

Static content (system prompt and tool schemas) is eligible for provider-side caching.

  • OpenAI: Automatic caching for requests with 1,024+ prompt tokens.
  • Anthropic (native, Bedrock, Vertex AI): LiteLLM cache_control_injection_points marks system message and tool config as cacheable. Subsequent turns pay ~10% of the original input cost for cached blocks.

Cache hit rate is visible in the token.usage SSE event via prompt_tokens_details.cached_tokens.

Conversation History Windowing

Before each LLM call, messages are windowed to prevent O(N²) token cost growth:

  • System messages: always preserved
  • First user message (intent anchor): always preserved
  • Last N messages: kept (default 40, configurable via LLM_CONTEXT_WINDOW_MESSAGES)
  • Orphaned tool messages (tool results without their preceding tool_calls): cleaned up

Memory Context Budget

Memory documents injected by memory_prepare are capped at MEMORY_CONTEXT_TOKEN_BUDGET (default 2,200 tokens). Retrieval over-fetches by 3× and trims to budget after re-ranking.

Memory System

The Engine integrates a persistent memory system that gives the agent durable operational context across sessions. See Memory for the full reference.

At the graph level:

  • memory_prepare runs on every new user turn before the model node.
  • Memory tool calls in gate bypass the MCP approval gate and go to MemoryVirtualToolExecutor.
  • Memory writes are safety-scanned and policy-checked before being persisted.
  • The memory.context.loaded SSE event surfaces what was injected, including trust levels.

SSE Streaming

All events stream from POST /api/v1/agent/chat. Event types:

EventSourceDescription
readySSE generatorStream initialized with run_id
ttftModel nodeTime-to-first-token measurement
thinkingModel nodeReasoning token stream
thinking.completeModel nodeFull thinking content with duration
tokenModel nodeContent token stream
token.usageModel nodeToken counts and cost (includes cached_tokens)
generation.startModel nodeLLM call initiated
generation.completeModel nodeLLM call finished
tools.pendingGate nodeTool calls identified, with requires_approval per tool
tool.awaiting_approvalTool executorMutating tool paused for approval
tool.approvedTool executorTool approved
tool.deniedTool executorTool denied
tool.executingTool executorTool execution started
tool.resultTool executorTool execution completed
tool.errorTool executorTool execution failed
rate_limitModel nodeRate limit hit, retrying with backoff
transient_errorModel nodeTransient error, retrying
memory.context.loadedMemory prepareDocuments injected; includes IDs, store slugs, paths, trust levels
memory.searchMemory executorAgent called memory_search; includes query and result count
memory.write.createdMemory executorAgent wrote a memory document
memory.write.blockedMemory executorWrite blocked by safety scanner
memory.policy.deniedMemory executorWrite blocked by policy engine
memory.promotion.proposedMemory executorAgent proposed a draft for admin promotion to workspace
memory.safety.flaggedMemory executorSafety scanner found findings during a write attempt
completedFinal nodeWorkflow completed with timing
workflow.errorGraph invokeUnrecoverable error
workflow_completeEndpointTerminal event for SSE cleanup
heartbeatSSE generatorKeepalive (60s idle timeout)
conversation.title.generatedEndpointAuto-generated conversation title

Memory events are observable in the Command Center UI. They are not part of tools.pending display.

LLM Integration

The Engine uses LiteLLM as a model-agnostic abstraction layer.

SettingDefaultPurpose
LLM_MODELModel identifier (openai/gpt-4o, anthropic/claude-sonnet-4-20250514, etc.)
LLM_HOSTOptional custom API base URL
LLM_MAX_ITERATIONS25Recursion limit for graph execution
LLM_CONTEXT_WINDOW_MESSAGES40Max messages kept in context per LLM call
LLM_THINKING_BUDGET_TOKENSExplicit thinking token budget
LLM_REASONING_EFFORThighReasoning effort level (low, medium, high, default)
LLM_MAX_TOKENSMax output tokens

Reasoning Support

The Engine auto-detects reasoning model support via litellm.supports_reasoning(). Configuration priority:

  1. Explicit LLM_THINKING_BUDGET_TOKENS env var
  2. Explicit LLM_REASONING_EFFORT env var
  3. Auto-detection (defaults to reasoning_effort: "high")
  4. Disabled (safe default for unknown models)

Retry Logic

Error TypeBehavior
RateLimitErrorExponential backoff up to 60s, max 3 retries
Transient (timeout, 502/503/504)Exponential backoff up to 30s, max 3 retries
OtherImmediate failure

Persistence

The create_event_callback function dual-writes to both Redis (for SSE) and ConversationPersistenceService (database). Persisted data includes:

  • Token usage and cost per turn
  • Time-to-first-token (TTFT) and time-to-response (TTR)
  • Thinking segments
  • Text segments
  • Tool execution segments (status: pending → executing → completed / error / denied)
  • Memory usage records (conversation_memory_usage — which docs were injected per run)

Authentication

JWT with refresh token rotation. HttpOnly cookies for session. First user to register becomes admin. Conversation access requires owner match or superuser.

Cooperative Stop

Users can stop a running workflow via POST /api/v1/agent/stop. A Redis flag is set. Every 25 streaming tokens and on each node entry, the Engine checks this flag. On detection, StopRequested propagates, the invoke() caller emits completed with status: "stopped", and the flag is cleared.