The Automation Spectrum
DevOps automation isn't new. What's new is the third option on the spectrum:
| Approach | How It Works | Strength | Weakness |
|---|---|---|---|
| Script Automation | Hardcoded bash/Python scripts triggered by events or cron | Predictable, testable, version-controlled | Brittle, no error recovery, can't handle novel situations |
| Raw AI | LLM generates and executes commands from natural language | Flexible, handles novel requests, natural language interface | Unpredictable, hallucinates, no safety boundaries |
| Agentic AI | Structured agents with typed tools, approval gates, and verification | Adaptive + safe, handles novel situations within bounded execution | Requires architecture investment, depends on tool ecosystem |
Most teams are stuck in the first column. Some have experimented with the second. The third is where the industry is heading, and the design decisions that separate agentic AI from "AI that runs scripts" are what determine whether it's production-grade or demo-grade.
Why Scripts Break: The Brittleness Problem
Every DevOps team has a scripts/ directory. It contains things like:
#!/bin/bash
# restart-stuck-pods.sh
# "Works on my cluster" — written at 3 AM during an incident, never refactored
NAMESPACE=${1:-default}
PODS=$(kubectl get pods -n $NAMESPACE --field-selector=status.phase=Failed -o name)
if [ -z "$PODS" ]; then
echo "No failed pods found"
exit 0
fi
for POD in $PODS; do
echo "Deleting $POD"
kubectl delete $POD -n $NAMESPACE --grace-period=30
done
echo "Done. Deleted $(echo $PODS | wc -w) pods."This script works until it doesn't. Here's the failure taxonomy:
Hardcoded assumptions. The script assumes failed pods should be deleted. But what if the pods are failed because of a node issue? Deleting them won't help. They'll just reschedule to the same broken node. The script can't distinguish between "pod has a bug" and "node is unhealthy."
No context awareness. The script doesn't check what else is running in the namespace. If payment-service has 2 of 3 replicas failed and you delete them, you've just reduced capacity to 33% during an incident. A human would check replica counts first. The script doesn't.
Brittle error handling. If kubectl delete fails (API server timeout, RBAC issue, resource already deleted), the script either continues blindly or exits entirely. There's no retry logic, no partial failure handling, no way to resume.
No verification. The script reports "Done. Deleted 5 pods." But it doesn't check whether new pods came up healthy. It doesn't verify that the original problem is resolved. "Deleted successfully" and "system is healthy" are not the same thing.
Maintenance burden. Over time, the scripts directory accumulates dozens of one-off scripts, each with slightly different conventions, different error handling (or none), and different assumptions about cluster state. New team members can't trust them. Experienced team members can't remember which ones are current.
The fundamental problem: scripts encode a specific solution to a specific problem. When the problem changes (different namespace layout, different resource types, different failure mode), the script breaks or does the wrong thing.
Why Raw AI Is Dangerous: The Hallucination Problem
The appeal of replacing scripts with AI is obvious: instead of maintaining 50 brittle scripts, you ask the AI what you need in natural language. No hardcoded paths. No brittle assumptions. The AI adapts.
The problem is that an LLM's adaptability is also its danger. It will confidently generate commands that look correct but are subtly wrong:
Namespace confusion.
You: "Delete the old test pods"
AI: kubectl delete pods --all -n productionThe AI interpreted "test pods" as "pods to be tested" in the production namespace, not pods in the test namespace. This is a realistic hallucination. LLMs optimize for helpfulness, not caution.
Flag hallucination.
You: "Force restart the stuck deployment"
AI: kubectl rollout restart deployment/api-gateway --force --cascade=true--force and --cascade=true are not valid flags for kubectl rollout restart. But the LLM has seen similar patterns in training data and confidently generates them. If you're piping this to a shell, the command fails. Or worse, a future kubectl version adds those flags with different semantics.
Scope creep.
You: "Clean up resources from last week's load test"
AI: kubectl delete namespace load-test-20250127
kubectl delete pvc -l app=load-test --all-namespaces
kubectl delete clusterrole load-test-runner
kubectl delete clusterrolebinding load-test-runner-bindingYou asked for cleanup. The AI helpfully expanded scope to include PVCs across all namespaces and cluster-level RBAC resources. Each individual command might be correct, but the aggregate action is far more aggressive than what you intended.
The core issue: An LLM generating shell commands has unbounded output. It can generate any command, including ones that are destructive, irrelevant, or hallucinated. There's no structural constraint on what it can produce, only statistical probability that the output will be correct.
The Agentic Middle Ground
Agentic AI combines the adaptability of AI with the safety boundaries of structured automation. The key architectural components:
1. Typed Tool Execution Instead of Shell Commands
Instead of generating arbitrary kubectl commands, an agentic system calls typed tools with validated parameters:
| Approach | What the AI produces | What gets executed |
|---|---|---|
| Raw AI | kubectl delete pods --all -n production | Exactly that string, in a shell |
| Agentic AI | kubernetes.delete_pod(name="test-runner-7d4f8", namespace="load-test") | A validated function call with schema checking |
kubectl delete pods --all -n productionkubernetes.delete_pod(name="test-runner-7d4f8", namespace="load-test")The typed tool call enforces several constraints:
- Parameter validation:
namespacemust match an existing namespace.namemust match an existing resource. Invalid parameters are rejected before execution. - Scope limitation: The
delete_podtool deletes one pod. There is nodelete_all_pods_everywheretool. The blast radius is bounded by the tool's definition, not the LLM's creativity. - Auditability: The structured tool call is logged with all parameters, the approver, and the result. Raw shell commands in a subprocess are typically not logged at this level of detail.
This is what MCP (Model Context Protocol) provides in Skyflo's architecture. Each tool is defined with a JSON Schema for inputs, a permission tag (read or write), and a structured output format. The LLM selects tools and fills parameters; it doesn't generate arbitrary commands.
2. Read/Write Boundary
Agentic systems distinguish between operations that observe and operations that change:
| Operation Type | Examples | Gate |
|---|---|---|
| Read | List pods, get logs, describe deployment, query Prometheus | Execute freely, no approval needed |
| Write | Scale deployment, apply manifest, delete pod, rollback Helm release | Requires human approval before execution |
This boundary is critical because most diagnostic work is read-only. When an agent investigates a latency spike, it needs to list pods, check events, read logs, and query metrics, all without asking for approval at every step. The approval gate fires only when the agent proposes a change.
Scripts don't have this concept. A script that "fixes" a problem typically mixes reads and writes in a single flow. An agentic system separates diagnosis (reads) from remediation (writes), giving the human control over the consequential step.
3. Planning Before Execution
Scripts execute immediately. Raw AI generates and executes in one step. Agentic AI plans first.
The planning phase serves two purposes:
It grounds the AI in reality. Instead of generating a fix from training data, the agent queries your actual cluster state. It knows how many replicas you have, what resource limits are set, which version is deployed. The proposed action is based on evidence, not pattern matching.
It creates a review point. The plan is presented to the human before any mutation occurs. You review the proposed action, the target resources, and the expected outcome. If the plan is wrong (wrong namespace, wrong resource, wrong approach), you catch it before it touches your cluster.
4. Verification After Execution
This is the step that neither scripts nor raw AI typically include. After execution, a verification step checks:
- Did the intended change actually apply?
- Is the system in the expected state?
- Are there unintended side effects?
Verification transforms "the command succeeded" into "the operation achieved its goal." Exit code 0 from kubectl apply means the API server accepted the manifest. It doesn't mean the pods are healthy, the service is reachable, or the rollout completed. Verification checks the things that matter.
A Side-by-Side Comparison
Let's take a concrete scenario: a deployment has pods in CrashLoopBackOff and you need to investigate and fix it.
Script approach:
#!/bin/bash
# diagnose-crashloop.sh
NS=$1
DEPLOY=$2
# Get failing pods
PODS=$(kubectl get pods -n $NS -l app=$DEPLOY --field-selector=status.phase!=Running -o name)
# Get logs from first failing pod
FIRST_POD=$(echo $PODS | head -1)
kubectl logs $FIRST_POD -n $NS --tail=50
# Common fix: restart the deployment
read -p "Restart deployment? (y/n) " CONFIRM
if [ "$CONFIRM" = "y" ]; then
kubectl rollout restart deployment/$DEPLOY -n $NS
fiThis script has a fixed diagnostic path (check logs, restart). If the issue is resource limits, a bad config map, or a missing secret, the script's diagnostic step doesn't help, and its fix (restart) won't resolve the underlying issue.
Raw AI approach:
You: "payment-service pods are crash-looping in production. Fix it."
AI: Let me check... *generates kubectl commands, executes them*
The pods are OOMKilling. I'll increase the memory limit.
*generates and executes kubectl patch without approval*
Done! Memory limit increased to 1Gi.Fast, but dangerous. The AI picked a memory value (1Gi) without checking what the current usage actually is. It didn't ask for approval before mutating a production deployment. It didn't verify the fix worked.
Agentic AI approach (Skyflo):
- Plan: Agent discovers pods, checks events (OOMKilled), checks resource limits (256Mi), checks actual usage (240Mi), checks deployment history (new version increased memory footprint).
- Propose: "Memory limit is 256Mi but usage peaks at 280Mi. Recommend increasing to 512Mi based on observed usage patterns."
- Approve: You review the specific patch, the target deployment, the namespace. You approve.
- Execute: Agent applies the patch via typed MCP tool call.
- Verify: Agent checks new pods are healthy, memory usage is within new limits, no OOMKill events, latency recovered.
Same outcome: memory limits get fixed. But the agentic approach is grounded in evidence, bounded by typed tools, gated by human approval, and verified by automated checks.
MCP: The Bridge Between Intent and Execution
The Model Context Protocol is what makes the "typed tool" concept practical at scale. Without MCP, you'd need to hardcode every tool as a custom function in your agent's codebase. With MCP, tools are defined once in an MCP server and automatically discovered by the agent.
Here's why this matters for the agentic vs scripts debate:
Scripts are static. To add a new capability, you write a new script. To update an existing capability, you edit the script and hope you don't break it for other use cases.
MCP tools are composable. The agent can combine tools in ways the tool author didn't anticipate. A tool for kubernetes.get_pod_logs and a tool for kubernetes.top_pods and a tool for prometheus.query can be combined by the agent to perform a diagnostic workflow that no single script encodes.
MCP enforces boundaries. Each tool declares its permission level. The agent can discover all available tools, but write tools always require approval. A script has no such boundary; it does whatever it's coded to do.
For teams evaluating this approach, the supported tools page shows the full MCP tool ecosystem available in Skyflo, including Kubernetes, Helm, Argo, Jenkins, and observability tools.
When to Use What
Agentic AI doesn't replace all scripts. Here's a practical decision framework:
| Use Case | Best Approach | Why |
|---|---|---|
| Scheduled cleanup (delete old pods every night) | Script or CronJob | Predictable, repetitive, low risk |
| CI/CD pipeline steps (build, test, push) | Script/pipeline tool | Deterministic, well-scoped, version-controlled |
| Incident investigation (why is this service slow?) | Agentic AI | Requires exploration, context, multi-step reasoning |
| Novel operational tasks (tasks you haven't scripted yet) | Agentic AI | No existing script, needs adaptability |
| Cross-tool correlation (K8s + Prometheus + Helm) | Agentic AI | Spans multiple tools, needs unified context |
| Compliance/audit operations | Agentic AI with strict gates | Needs structured execution + approval trail |
The pattern: use scripts for predictable, repetitive, well-scoped tasks. Use agentic AI for exploratory, novel, or cross-domain operations where the workflow isn't known in advance.
The Migration Path
You don't rip out your scripts directory on day one. The practical path:
- Start with diagnosis. Use the AI agent for read-only investigation. No mutations, no risk. Get comfortable with how the agent explores your cluster.
- Add approval-gated operations. Start with low-risk mutations: scaling a staging deployment, restarting a non-critical pod. Build trust in the approval flow.
- Retire scripts incrementally. As the agent demonstrates reliability for specific use cases, retire the corresponding scripts. Keep scripts for truly deterministic operations (CronJobs, CI steps).
- Expand the tool ecosystem. Add MCP tools for your specific infrastructure: custom CRDs, internal APIs, observability stacks. The agent becomes more capable without writing more scripts.
The goal isn't "replace bash with AI." It's "stop encoding operational knowledge in brittle scripts and start encoding it in composable, safe, verifiable agent workflows."
For a hands-on walkthrough of what this looks like in practice, see Fixing a Latency Spike in payment-service.
Try Skyflo
Skyflo is open-source, self-hosted, and built on the agentic architecture described in this article. Plan → Execute → Verify with human-in-the-loop at every mutating step.
helm repo add skyflo https://charts.skyflo.ai
helm install skyflo skyflo/skyflo