Blog

Agentic AI vs Script Automation in DevOps

Bash scripts are brittle. Raw AI is dangerous. Agentic AI — with typed tools, planning, and verification — gives you the best of both worlds. Here's why.

9 min read
ai-agentsdevopsautomationagentic-aistrategy

The Automation Spectrum

DevOps automation isn't new. What's new is the third option on the spectrum:

ApproachScript Automation
How It WorksHardcoded bash/Python scripts triggered by events or cron
StrengthPredictable, testable, version-controlled
WeaknessBrittle, no error recovery, can't handle novel situations
ApproachRaw AI
How It WorksLLM generates and executes commands from natural language
StrengthFlexible, handles novel requests, natural language interface
WeaknessUnpredictable, hallucinates, no safety boundaries
ApproachAgentic AI
How It WorksStructured agents with typed tools, approval gates, and verification
StrengthAdaptive + safe, handles novel situations within bounded execution
WeaknessRequires architecture investment, depends on tool ecosystem

Most teams are stuck in the first column. Some have experimented with the second. The third is where the industry is heading, and the design decisions that separate agentic AI from "AI that runs scripts" are what determine whether it's production-grade or demo-grade.


Why Scripts Break: The Brittleness Problem

Every DevOps team has a scripts/ directory. It contains things like:

bash
#!/bin/bash
# restart-stuck-pods.sh
# "Works on my cluster" — written at 3 AM during an incident, never refactored

NAMESPACE=${1:-default}
PODS=$(kubectl get pods -n $NAMESPACE --field-selector=status.phase=Failed -o name)

if [ -z "$PODS" ]; then
  echo "No failed pods found"
  exit 0
fi

for POD in $PODS; do
  echo "Deleting $POD"
  kubectl delete $POD -n $NAMESPACE --grace-period=30
done

echo "Done. Deleted $(echo $PODS | wc -w) pods."

This script works until it doesn't. Here's the failure taxonomy:

Hardcoded assumptions. The script assumes failed pods should be deleted. But what if the pods are failed because of a node issue? Deleting them won't help. They'll just reschedule to the same broken node. The script can't distinguish between "pod has a bug" and "node is unhealthy."

No context awareness. The script doesn't check what else is running in the namespace. If payment-service has 2 of 3 replicas failed and you delete them, you've just reduced capacity to 33% during an incident. A human would check replica counts first. The script doesn't.

Brittle error handling. If kubectl delete fails (API server timeout, RBAC issue, resource already deleted), the script either continues blindly or exits entirely. There's no retry logic, no partial failure handling, no way to resume.

No verification. The script reports "Done. Deleted 5 pods." But it doesn't check whether new pods came up healthy. It doesn't verify that the original problem is resolved. "Deleted successfully" and "system is healthy" are not the same thing.

Maintenance burden. Over time, the scripts directory accumulates dozens of one-off scripts, each with slightly different conventions, different error handling (or none), and different assumptions about cluster state. New team members can't trust them. Experienced team members can't remember which ones are current.

The fundamental problem: scripts encode a specific solution to a specific problem. When the problem changes (different namespace layout, different resource types, different failure mode), the script breaks or does the wrong thing.


Why Raw AI Is Dangerous: The Hallucination Problem

The appeal of replacing scripts with AI is obvious: instead of maintaining 50 brittle scripts, you ask the AI what you need in natural language. No hardcoded paths. No brittle assumptions. The AI adapts.

The problem is that an LLM's adaptability is also its danger. It will confidently generate commands that look correct but are subtly wrong:

Namespace confusion.

code
You: "Delete the old test pods"
AI:  kubectl delete pods --all -n production

The AI interpreted "test pods" as "pods to be tested" in the production namespace, not pods in the test namespace. This is a realistic hallucination. LLMs optimize for helpfulness, not caution.

Flag hallucination.

code
You: "Force restart the stuck deployment"
AI:  kubectl rollout restart deployment/api-gateway --force --cascade=true

--force and --cascade=true are not valid flags for kubectl rollout restart. But the LLM has seen similar patterns in training data and confidently generates them. If you're piping this to a shell, the command fails. Or worse, a future kubectl version adds those flags with different semantics.

Scope creep.

code
You: "Clean up resources from last week's load test"
AI:  kubectl delete namespace load-test-20250127
     kubectl delete pvc -l app=load-test --all-namespaces
     kubectl delete clusterrole load-test-runner
     kubectl delete clusterrolebinding load-test-runner-binding

You asked for cleanup. The AI helpfully expanded scope to include PVCs across all namespaces and cluster-level RBAC resources. Each individual command might be correct, but the aggregate action is far more aggressive than what you intended.

The core issue: An LLM generating shell commands has unbounded output. It can generate any command, including ones that are destructive, irrelevant, or hallucinated. There's no structural constraint on what it can produce, only statistical probability that the output will be correct.


The Agentic Middle Ground

Agentic AI combines the adaptability of AI with the safety boundaries of structured automation. The key architectural components:

1. Typed Tool Execution Instead of Shell Commands

Instead of generating arbitrary kubectl commands, an agentic system calls typed tools with validated parameters:

ApproachRaw AI
What the AI produceskubectl delete pods --all -n production
What gets executedExactly that string, in a shell
ApproachAgentic AI
What the AI produceskubernetes.delete_pod(name="test-runner-7d4f8", namespace="load-test")
What gets executedA validated function call with schema checking

The typed tool call enforces several constraints:

  • Parameter validation: namespace must match an existing namespace. name must match an existing resource. Invalid parameters are rejected before execution.
  • Scope limitation: The delete_pod tool deletes one pod. There is no delete_all_pods_everywhere tool. The blast radius is bounded by the tool's definition, not the LLM's creativity.
  • Auditability: The structured tool call is logged with all parameters, the approver, and the result. Raw shell commands in a subprocess are typically not logged at this level of detail.

This is what MCP (Model Context Protocol) provides in Skyflo's architecture. Each tool is defined with a JSON Schema for inputs, a permission tag (read or write), and a structured output format. The LLM selects tools and fills parameters; it doesn't generate arbitrary commands.

2. Read/Write Boundary

Agentic systems distinguish between operations that observe and operations that change:

Operation TypeRead
ExamplesList pods, get logs, describe deployment, query Prometheus
GateExecute freely, no approval needed
Operation TypeWrite
ExamplesScale deployment, apply manifest, delete pod, rollback Helm release
GateRequires human approval before execution

This boundary is critical because most diagnostic work is read-only. When an agent investigates a latency spike, it needs to list pods, check events, read logs, and query metrics, all without asking for approval at every step. The approval gate fires only when the agent proposes a change.

Scripts don't have this concept. A script that "fixes" a problem typically mixes reads and writes in a single flow. An agentic system separates diagnosis (reads) from remediation (writes), giving the human control over the consequential step.

3. Planning Before Execution

Scripts execute immediately. Raw AI generates and executes in one step. Agentic AI plans first.

The planning phase serves two purposes:

It grounds the AI in reality. Instead of generating a fix from training data, the agent queries your actual cluster state. It knows how many replicas you have, what resource limits are set, which version is deployed. The proposed action is based on evidence, not pattern matching.

It creates a review point. The plan is presented to the human before any mutation occurs. You review the proposed action, the target resources, and the expected outcome. If the plan is wrong (wrong namespace, wrong resource, wrong approach), you catch it before it touches your cluster.

4. Verification After Execution

This is the step that neither scripts nor raw AI typically include. After execution, a verification step checks:

  • Did the intended change actually apply?
  • Is the system in the expected state?
  • Are there unintended side effects?

Verification transforms "the command succeeded" into "the operation achieved its goal." Exit code 0 from kubectl apply means the API server accepted the manifest. It doesn't mean the pods are healthy, the service is reachable, or the rollout completed. Verification checks the things that matter.


A Side-by-Side Comparison

Let's take a concrete scenario: a deployment has pods in CrashLoopBackOff and you need to investigate and fix it.

Script approach:

bash
#!/bin/bash
# diagnose-crashloop.sh
NS=$1
DEPLOY=$2

# Get failing pods
PODS=$(kubectl get pods -n $NS -l app=$DEPLOY --field-selector=status.phase!=Running -o name)

# Get logs from first failing pod
FIRST_POD=$(echo $PODS | head -1)
kubectl logs $FIRST_POD -n $NS --tail=50

# Common fix: restart the deployment
read -p "Restart deployment? (y/n) " CONFIRM
if [ "$CONFIRM" = "y" ]; then
  kubectl rollout restart deployment/$DEPLOY -n $NS
fi

This script has a fixed diagnostic path (check logs, restart). If the issue is resource limits, a bad config map, or a missing secret, the script's diagnostic step doesn't help, and its fix (restart) won't resolve the underlying issue.

Raw AI approach:

code
You: "payment-service pods are crash-looping in production. Fix it."
AI:  Let me check... *generates kubectl commands, executes them*
     The pods are OOMKilling. I'll increase the memory limit.
     *generates and executes kubectl patch without approval*
     Done! Memory limit increased to 1Gi.

Fast, but dangerous. The AI picked a memory value (1Gi) without checking what the current usage actually is. It didn't ask for approval before mutating a production deployment. It didn't verify the fix worked.

Agentic AI approach (Skyflo):

  1. Plan: Agent discovers pods, checks events (OOMKilled), checks resource limits (256Mi), checks actual usage (240Mi), checks deployment history (new version increased memory footprint).
  2. Propose: "Memory limit is 256Mi but usage peaks at 280Mi. Recommend increasing to 512Mi based on observed usage patterns."
  3. Approve: You review the specific patch, the target deployment, the namespace. You approve.
  4. Execute: Agent applies the patch via typed MCP tool call.
  5. Verify: Agent checks new pods are healthy, memory usage is within new limits, no OOMKill events, latency recovered.

Same outcome: memory limits get fixed. But the agentic approach is grounded in evidence, bounded by typed tools, gated by human approval, and verified by automated checks.


MCP: The Bridge Between Intent and Execution

The Model Context Protocol is what makes the "typed tool" concept practical at scale. Without MCP, you'd need to hardcode every tool as a custom function in your agent's codebase. With MCP, tools are defined once in an MCP server and automatically discovered by the agent.

Here's why this matters for the agentic vs scripts debate:

Scripts are static. To add a new capability, you write a new script. To update an existing capability, you edit the script and hope you don't break it for other use cases.

MCP tools are composable. The agent can combine tools in ways the tool author didn't anticipate. A tool for kubernetes.get_pod_logs and a tool for kubernetes.top_pods and a tool for prometheus.query can be combined by the agent to perform a diagnostic workflow that no single script encodes.

MCP enforces boundaries. Each tool declares its permission level. The agent can discover all available tools, but write tools always require approval. A script has no such boundary; it does whatever it's coded to do.

For teams evaluating this approach, the supported tools page shows the full MCP tool ecosystem available in Skyflo, including Kubernetes, Helm, Argo, Jenkins, and observability tools.


When to Use What

Agentic AI doesn't replace all scripts. Here's a practical decision framework:

Use CaseScheduled cleanup (delete old pods every night)
Best ApproachScript or CronJob
WhyPredictable, repetitive, low risk
Use CaseCI/CD pipeline steps (build, test, push)
Best ApproachScript/pipeline tool
WhyDeterministic, well-scoped, version-controlled
Use CaseIncident investigation (why is this service slow?)
Best ApproachAgentic AI
WhyRequires exploration, context, multi-step reasoning
Use CaseNovel operational tasks (tasks you haven't scripted yet)
Best ApproachAgentic AI
WhyNo existing script, needs adaptability
Use CaseCross-tool correlation (K8s + Prometheus + Helm)
Best ApproachAgentic AI
WhySpans multiple tools, needs unified context
Use CaseCompliance/audit operations
Best ApproachAgentic AI with strict gates
WhyNeeds structured execution + approval trail

The pattern: use scripts for predictable, repetitive, well-scoped tasks. Use agentic AI for exploratory, novel, or cross-domain operations where the workflow isn't known in advance.


The Migration Path

You don't rip out your scripts directory on day one. The practical path:

  1. Start with diagnosis. Use the AI agent for read-only investigation. No mutations, no risk. Get comfortable with how the agent explores your cluster.
  2. Add approval-gated operations. Start with low-risk mutations: scaling a staging deployment, restarting a non-critical pod. Build trust in the approval flow.
  3. Retire scripts incrementally. As the agent demonstrates reliability for specific use cases, retire the corresponding scripts. Keep scripts for truly deterministic operations (CronJobs, CI steps).
  4. Expand the tool ecosystem. Add MCP tools for your specific infrastructure: custom CRDs, internal APIs, observability stacks. The agent becomes more capable without writing more scripts.

The goal isn't "replace bash with AI." It's "stop encoding operational knowledge in brittle scripts and start encoding it in composable, safe, verifiable agent workflows."

For a hands-on walkthrough of what this looks like in practice, see Fixing a Latency Spike in payment-service.


Try Skyflo

Skyflo is open-source, self-hosted, and built on the agentic architecture described in this article. Plan → Execute → Verify with human-in-the-loop at every mutating step.

bash
helm repo add skyflo https://charts.skyflo.ai
helm install skyflo skyflo/skyflo
Schedule a Demo

See Skyflo in Action

Book a personalized demo with our team. We'll show you how Skyflo can transform your DevOps workflows.