AI Operations Agent

AI Agent for Kubernetes.
Approval-Gated by Design.

Self-hosted AI agent for Kubernetes and CI/CD. Converts intent into typed, approval-gated infrastructure operations inside your cluster.

Open SourceSelf-HostedNo TelemetryMulti-LLM
The Post-Code Bottleneck

Code Ships Faster Than Ever.
Your Operations Haven't Kept Up.

AI coding tools accelerated development. But deploying, operating, and keeping production alive? That's still manual, fragmented, and dangerously risky.

01

Brittle Scripts & Manual kubectl

Your deployment process is held together by tribal knowledge and shell scripts that break at 2 AM.

Tribal knowledge
02

Fragmented Visibility

Prometheus in one tab, Grafana in another, kubectl in the terminal, Slack on fire. Five tools, zero unified context when it matters most.

Zero unified context
03

Every Mutation Is a Blind Risk

kubectl apply with fingers crossed. No dry-run, no rollback plan, no verification that the change did what you intended.

No safety net
Deterministic Control Loop

Not a Black Box.
A Deterministic Control Loop.

Every infrastructure change follows four auditable steps. The agent plans, you decide, the outcome is verified.

01

Plan

Memory context loads automatically before each planning turn: prior incidents, runbooks, and cluster conventions. The agent then analyzes your intent and generates a concrete, evidence-grounded action plan.

Memory-informed planning
02

Approve

Every mutating tool call pauses for explicit human approval. Read operations flow freely. The gate is enforced by the engine, not a UI toggle you can turn off.

Human gate on mutations
03

Execute

Typed tools run via MCP. Schema-validated inputs. Sandboxed execution. Full audit trail for every operation — who asked, what was planned, who approved, what ran.

Typed, sandboxed execution
04

Verify

Agent validates the outcome against your original intent. If cluster state drifts from the plan, it flags the discrepancy and suggests remediation. Findings can be persisted to memory for future incidents.

Post-action verification

The approval gate is enforced at the engine level. Not a setting. Not configurable off.

Persistent Operational Context

The Agent Learns Your Infrastructure.
Context That Outlasts the Incident.

Prior incidents, runbooks, and cluster conventions are retrieved from Postgres before each model turn. No context loss between sessions.

Runbooks and Incident History

Prior incidents, remediation playbooks, and cluster conventions retrieved automatically before each diagnostic. The agent starts informed, not from scratch.

Trust-Ranked Retrieval

Admin-approved workspace docs outrank user notes. Agent drafts are surfaced as candidates, not ground truth. Context quality is controlled, not just aggregated.

Safety-Scanned Writes

The safety scanner blocks credentials, API tokens, private keys, and high-entropy strings from being persisted. Sensitive data never reaches the memory store.

Advisory by Design

Memory provides starting hypotheses. The agent always verifies live cluster state with tools before acting on memory. Prior context never bypasses tool confirmation.

Workspace Memoryhighest trust

Admin-approved runbooks and conventions shared across the team.

User Memoryowner only

Personal preferences and operational shortcuts owned by each operator.

Conversation Memorydraft

Ephemeral agent drafts from the current session. Can be proposed for workspace promotion.

Memory is advisory. Live tool verification always takes precedence over what memory says.

See It in Action

Real Operations.
Against a Live Cluster.

Not mockups or scripted demos. Watch Skyflo handle the workflows your team runs every day.

Faster Diagnosis. Safer Changes.
Auditable Operations.

Architecture is table stakes. These are the operational outcomes that matter to your team.

Faster Incident Diagnosis

Agent correlates logs, events, and resource state in a single pass. No more context-switching across dashboards.

Consistent, Auditable Deployments

No more ad-hoc kubectl runs or untracked mutations. Every change is repeatable and auditable.

Approval Gates on Writes, Not Reads

Read operations flow freely. Mutating tool calls require explicit approval. Your developers move fast. Your infrastructure stays safe.

Used in production at

Storylane
Get Started

Your Cluster. Your Agent.
Running in Minutes.

Deploy on your cluster with your own LLM. No Skyflo telemetry or phone-home.

Apache 2.0Self-HostedApproval-Gated
Skyflo for Teams

Ready for Your Team?
Scale with Confidence

Team adds collaboration, governance, and integrations. Same agent. Same control loop. Same approval gates.

Chat Integration

Operate from Slack, Microsoft Teams, and more

SCM Integration

Persist changes to GitHub, GitLab, Bitbucket

AI Alerting Agent

Anomaly correlation and proactive detection

RBAC & Governance

Team permissions, audit trails, SSO

RBAC & Team Management
Audit Trails
Self-Hosted
SSO Compatible
Core Capabilities

An Execution Runtime.
Not a Chat Wrapper.

Every capability maps to an operational outcome.

01

Natural Language to Typed Execution

Describe what you need in plain English. Skyflo converts intent into schema-validated tool calls, not shell-injected strings.

Intent → Execution
02

Unified Cluster Context

Logs, events, resource state, and configuration correlated in one pass. Diagnose a CrashLoopBackOff without switching between five terminals.

One interface
03

Graph-Based Workflow Engine

A LangGraph-powered workflow with distinct phases: planning, approval gate, execution, verification. Deterministic. Replayable. Not a monolithic LLM call.

Deterministic flow
04

Live Agent Reasoning

Agent thoughts, tool progress, memory retrievals, and results streamed in real time via SSE. Full visibility into every decision.

Real-time streaming
05

Post-Action Verification

The agent validates outcomes against your original intent. Drifts are flagged. Verified findings can be saved to memory for future incidents.

Outcome validation
06

Extensible via MCP

Every tool follows the Model Context Protocol. Typed inputs, sandboxed execution, defined safety model per tool. Community contributions welcome.

Open standard

Persistent Memory Context

New

Prior incidents, runbooks, and cluster conventions are retrieved from Postgres before each planning turn. The agent starts informed. Writes are policy-gated and safety-scanned. Memory is advisory — live tool verification always takes precedence.

memory_searchmemory_remembermemory_propose_promotion

Every capability ships with open source.

MCP Tool Ecosystem

Deep Coverage.
Efficient Context by Design.

Each toolset is loaded on-demand, not all at once. The agent starts lean and requests only the schemas it needs for your specific query.

Kubernetes

Orchestration
default context

Discovery, logs, exec, apply, diff

  • Discover resources across namespaces
  • Stream pod logs and exec into containers
  • Drain and cordon nodes safely
  • Preview changes with diff before apply
Read: Auto
Write: Human Approval

Helm

Package Management
load on-demand

Search, install, upgrade, rollback

  • Install charts with custom values
  • Upgrade releases with dry-run preview
  • Roll back to any previous revision
  • Manage chart repositories
Read: Auto
Write: Dry-run + Diff

Argo Rollouts

Progressive Delivery
load on-demand

Pause, resume, promote, abort

  • Run canary and blue-green deployments
  • Promote or abort with human gate
  • Monitor analysis runs and experiments
  • Track full rollout history and status
Read: Auto
Write: Human Gate

Jenkins

CI/CD
load on-demand

Jobs, builds, logs, SCM, identity

  • Manage and trigger build jobs
  • Stream build logs in real time
  • Inspect SCM configurations
  • Authenticate via Kubernetes Secrets
Read: Auto
Write: Secure Auth + CSRF
On-Demand Toolset Loading

The agent starts with Kubernetes read-only schemas in context. It calls load_toolset to add Helm, Argo, or Jenkins only when your query needs them. Deep coverage without bloating every turn with unused schemas.

On the Roadmap

AWS
GCP
Azure
GitHub Actions
GitLab CI

Same typed, sandboxed pattern. All open source.

Open Source. Self-Hosted.

An AI Agent in Your Cluster
Should Be Yours to Audit.

Apache 2.0 licensed. The agent, the control loop, and the safety model are all inspectable and under your control.

01

Full Source Transparency

Every tool call, every decision path, every safety check is in the source.

Apache 2.0 licensed
02

Self-Hosted, In-Cluster

Runs inside your Kubernetes cluster. LLM calls go only to the provider you configure.

Your data stays yours
03

Bring Your Own LLM, No Lock-in

OpenAI, Anthropic, Gemini, Groq, or self-hosted models. Switch providers without changing workflows.

Bring your own LLM
04

Safety Is Not a Premium Feature

Approval gates ship with open source. No feature gates on safety. No usage limits.

Free forever

No black-box agent decisions. No Skyflo telemetry.

Built in the Open

Transparent, auditable, and built for operators managing production Kubernetes.

Open Source

Full source code available under the Apache 2.0 license. Audit every line. No black boxes in your production stack.

View on GitHub

Join Our Channels

Connect with operators and developers building on Skyflo.

Frequently Asked Questions

Common questions about Skyflo and approval-gated operations.

Install and Run Your First Operation

Install Skyflo on your cluster and run your first operation today.

terminal
$curl -fsSL https://skyflo.ai/install.sh | bash