Agentic AI vs Script Automation in DevOps

The Automation Spectrum

DevOps automation isn't new. What's new is the third option on the spectrum:

Approach	How It Works	Strength	Weakness
Script Automation	Hardcoded bash/Python scripts triggered by events or cron	Predictable, testable, version-controlled	Brittle, no error recovery, can't handle novel situations
Raw AI	LLM generates and executes commands from natural language	Flexible, handles novel requests, natural language interface	Unpredictable, hallucinates, no safety boundaries
Agentic AI	Structured agents with typed tools, approval gates, and verification	Adaptive + safe, handles novel situations within bounded execution	Requires architecture investment, depends on tool ecosystem

ApproachScript Automation

How It WorksHardcoded bash/Python scripts triggered by events or cron

StrengthPredictable, testable, version-controlled

WeaknessBrittle, no error recovery, can't handle novel situations

ApproachRaw AI

How It WorksLLM generates and executes commands from natural language

StrengthFlexible, handles novel requests, natural language interface

WeaknessUnpredictable, hallucinates, no safety boundaries

ApproachAgentic AI

How It WorksStructured agents with typed tools, approval gates, and verification

StrengthAdaptive + safe, handles novel situations within bounded execution

WeaknessRequires architecture investment, depends on tool ecosystem

Most teams are stuck in the first column. Some have experimented with the second. The third is where the industry is heading, and the design decisions that separate agentic AI from "AI that runs scripts" are what determine whether it's production-grade or demo-grade.

Why Scripts Break: The Brittleness Problem

Every DevOps team has a scripts/ directory. It contains things like:

bash

#!/bin/bash
# restart-stuck-pods.sh
# "Works on my cluster" — written at 3 AM during an incident, never refactored

NAMESPACE=${1:-default}
PODS=$(kubectl get pods -n $NAMESPACE --field-selector=status.phase=Failed -o name)

if [ -z "$PODS" ]; then
  echo "No failed pods found"
  exit 0
fi

for POD in $PODS; do
  echo "Deleting $POD"
  kubectl delete $POD -n $NAMESPACE --grace-period=30
done

echo "Done. Deleted $(echo $PODS | wc -w) pods."

This script works until it doesn't. Here's the failure taxonomy:

Hardcoded assumptions. The script assumes failed pods should be deleted. But what if the pods are failed because of a node issue? Deleting them won't help. They'll just reschedule to the same broken node. The script can't distinguish between "pod has a bug" and "node is unhealthy."

No context awareness. The script doesn't check what else is running in the namespace. If payment-service has 2 of 3 replicas failed and you delete them, you've just reduced capacity to 33% during an incident. A human would check replica counts first. The script doesn't.

Brittle error handling. If kubectl delete fails (API server timeout, RBAC issue, resource already deleted), the script either continues blindly or exits entirely. There's no retry logic, no partial failure handling, no way to resume.

No verification. The script reports "Done. Deleted 5 pods." But it doesn't check whether new pods came up healthy. It doesn't verify that the original problem is resolved. "Deleted successfully" and "system is healthy" are not the same thing.

Maintenance burden. Over time, the scripts directory accumulates dozens of one-off scripts, each with slightly different conventions, different error handling (or none), and different assumptions about cluster state. New team members can't trust them. Experienced team members can't remember which ones are current.

The fundamental problem: scripts encode a specific solution to a specific problem. When the problem changes (different namespace layout, different resource types, different failure mode), the script breaks or does the wrong thing.

Why Raw AI Is Dangerous: The Hallucination Problem

The appeal of replacing scripts with AI is obvious: instead of maintaining 50 brittle scripts, you ask the AI what you need in natural language. No hardcoded paths. No brittle assumptions. The AI adapts.

The problem is that an LLM's adaptability is also its danger. It will confidently generate commands that look correct but are subtly wrong:

Namespace confusion.

code

You: "Delete the old test pods"
AI:  kubectl delete pods --all -n production

The AI interpreted "test pods" as "pods to be tested" in the production namespace, not pods in the test namespace. This is a realistic hallucination. LLMs optimize for helpfulness, not caution.

Flag hallucination.

code

You: "Force restart the stuck deployment"
AI:  kubectl rollout restart deployment/api-gateway --force --cascade=true

--force and --cascade=true are not valid flags for kubectl rollout restart. But the LLM has seen similar patterns in training data and confidently generates them. If you're piping this to a shell, the command fails. Or worse, a future kubectl version adds those flags with different semantics.

Scope creep.

code

You: "Clean up resources from last week's load test"
AI:  kubectl delete namespace load-test-20250127
     kubectl delete pvc -l app=load-test --all-namespaces
     kubectl delete clusterrole load-test-runner
     kubectl delete clusterrolebinding load-test-runner-binding

You asked for cleanup. The AI helpfully expanded scope to include PVCs across all namespaces and cluster-level RBAC resources. Each individual command might be correct, but the aggregate action is far more aggressive than what you intended.

The core issue: An LLM generating shell commands has unbounded output. It can generate any command, including ones that are destructive, irrelevant, or hallucinated. There's no structural constraint on what it can produce, only statistical probability that the output will be correct.

The Agentic Middle Ground

Agentic AI combines the adaptability of AI with the safety boundaries of structured automation. The key architectural components:

1. Typed Tool Execution Instead of Shell Commands

Instead of generating arbitrary kubectl commands, an agentic system calls typed tools with validated parameters:

Approach	What the AI produces	What gets executed
Raw AI	`kubectl delete pods --all -n production`	Exactly that string, in a shell
Agentic AI	`kubernetes.delete_pod(name="test-runner-7d4f8", namespace="load-test")`	A validated function call with schema checking

ApproachRaw AI

What the AI produceskubectl delete pods --all -n production

What gets executedExactly that string, in a shell

ApproachAgentic AI

What the AI produceskubernetes.delete_pod(name="test-runner-7d4f8", namespace="load-test")

What gets executedA validated function call with schema checking

The typed tool call enforces several constraints:

Parameter validation: namespace must match an existing namespace. name must match an existing resource. Invalid parameters are rejected before execution.
Scope limitation: The delete_pod tool deletes one pod. There is no delete_all_pods_everywhere tool. The blast radius is bounded by the tool's definition, not the LLM's creativity.
Auditability: The structured tool call is logged with all parameters, the approver, and the result. Raw shell commands in a subprocess are typically not logged at this level of detail.

This is what MCP (Model Context Protocol) provides in Skyflo's architecture. Each tool is defined with a JSON Schema for inputs, a permission tag (read or write), and a structured output format. The LLM selects tools and fills parameters; it doesn't generate arbitrary commands.

2. Read/Write Boundary

Agentic systems distinguish between operations that observe and operations that change:

Operation Type	Examples	Gate
Read	List pods, get logs, describe deployment, query Prometheus	Execute freely, no approval needed
Write	Scale deployment, apply manifest, delete pod, rollback Helm release	Requires human approval before execution

Operation TypeRead

ExamplesList pods, get logs, describe deployment, query Prometheus

GateExecute freely, no approval needed

Operation TypeWrite

ExamplesScale deployment, apply manifest, delete pod, rollback Helm release

GateRequires human approval before execution

This boundary is critical because most diagnostic work is read-only. When an agent investigates a latency spike, it needs to list pods, check events, read logs, and query metrics, all without asking for approval at every step. The approval gate fires only when the agent proposes a change.

Scripts don't have this concept. A script that "fixes" a problem typically mixes reads and writes in a single flow. An agentic system separates diagnosis (reads) from remediation (writes), giving the human control over the consequential step.

3. Planning Before Execution

Scripts execute immediately. Raw AI generates and executes in one step. Agentic AI plans first.

The planning phase serves two purposes:

It grounds the AI in reality. Instead of generating a fix from training data, the agent queries your actual cluster state. It knows how many replicas you have, what resource limits are set, which version is deployed. The proposed action is based on evidence, not pattern matching.

It creates a review point. The plan is presented to the human before any mutation occurs. You review the proposed action, the target resources, and the expected outcome. If the plan is wrong (wrong namespace, wrong resource, wrong approach), you catch it before it touches your cluster.

4. Verification After Execution

This is the step that neither scripts nor raw AI typically include. After execution, a verification step checks:

Did the intended change actually apply?
Is the system in the expected state?
Are there unintended side effects?

Verification transforms "the command succeeded" into "the operation achieved its goal." Exit code 0 from kubectl apply means the API server accepted the manifest. It doesn't mean the pods are healthy, the service is reachable, or the rollout completed. Verification checks the things that matter.

A Side-by-Side Comparison

Let's take a concrete scenario: a deployment has pods in CrashLoopBackOff and you need to investigate and fix it.

Script approach:

bash

#!/bin/bash
# diagnose-crashloop.sh
NS=$1
DEPLOY=$2

# Get failing pods
PODS=$(kubectl get pods -n $NS -l app=$DEPLOY --field-selector=status.phase!=Running -o name)

# Get logs from first failing pod
FIRST_POD=$(echo $PODS | head -1)
kubectl logs $FIRST_POD -n $NS --tail=50

# Common fix: restart the deployment
read -p "Restart deployment? (y/n) " CONFIRM
if [ "$CONFIRM" = "y" ]; then
  kubectl rollout restart deployment/$DEPLOY -n $NS
fi

This script has a fixed diagnostic path (check logs, restart). If the issue is resource limits, a bad config map, or a missing secret, the script's diagnostic step doesn't help, and its fix (restart) won't resolve the underlying issue.

Raw AI approach:

code

You: "payment-service pods are crash-looping in production. Fix it."
AI:  Let me check... *generates kubectl commands, executes them*
     The pods are OOMKilling. I'll increase the memory limit.
     *generates and executes kubectl patch without approval*
     Done! Memory limit increased to 1Gi.

Fast, but dangerous. The AI picked a memory value (1Gi) without checking what the current usage actually is. It didn't ask for approval before mutating a production deployment. It didn't verify the fix worked.

Agentic AI approach (Skyflo):

Plan: Agent discovers pods, checks events (OOMKilled), checks resource limits (256Mi), checks actual usage (240Mi), checks deployment history (new version increased memory footprint).
Propose: "Memory limit is 256Mi but usage peaks at 280Mi. Recommend increasing to 512Mi based on observed usage patterns."
Approve: You review the specific patch, the target deployment, the namespace. You approve.
Execute: Agent applies the patch via typed MCP tool call.
Verify: Agent checks new pods are healthy, memory usage is within new limits, no OOMKill events, latency recovered.

Same outcome: memory limits get fixed. But the agentic approach is grounded in evidence, bounded by typed tools, gated by human approval, and verified by automated checks.

MCP: The Bridge Between Intent and Execution

The Model Context Protocol is what makes the "typed tool" concept practical at scale. Without MCP, you'd need to hardcode every tool as a custom function in your agent's codebase. With MCP, tools are defined once in an MCP server and automatically discovered by the agent.

Here's why this matters for the agentic vs scripts debate:

Scripts are static. To add a new capability, you write a new script. To update an existing capability, you edit the script and hope you don't break it for other use cases.

MCP tools are composable. The agent can combine tools in ways the tool author didn't anticipate. A tool for kubernetes.get_pod_logs and a tool for kubernetes.top_pods and a tool for prometheus.query can be combined by the agent to perform a diagnostic workflow that no single script encodes.

MCP enforces boundaries. Each tool declares its permission level. The agent can discover all available tools, but write tools always require approval. A script has no such boundary; it does whatever it's coded to do.

For teams evaluating this approach, the supported tools page shows the full MCP tool ecosystem available in Skyflo, including Kubernetes, Helm, Argo, Jenkins, and observability tools.

When to Use What

Agentic AI doesn't replace all scripts. Here's a practical decision framework:

Use Case	Best Approach	Why
Scheduled cleanup (delete old pods every night)	Script or CronJob	Predictable, repetitive, low risk
CI/CD pipeline steps (build, test, push)	Script/pipeline tool	Deterministic, well-scoped, version-controlled
Incident investigation (why is this service slow?)	Agentic AI	Requires exploration, context, multi-step reasoning
Novel operational tasks (tasks you haven't scripted yet)	Agentic AI	No existing script, needs adaptability
Cross-tool correlation (K8s + Prometheus + Helm)	Agentic AI	Spans multiple tools, needs unified context
Compliance/audit operations	Agentic AI with strict gates	Needs structured execution + approval trail

Use CaseScheduled cleanup (delete old pods every night)

Best ApproachScript or CronJob

WhyPredictable, repetitive, low risk

Use CaseCI/CD pipeline steps (build, test, push)

Best ApproachScript/pipeline tool

WhyDeterministic, well-scoped, version-controlled

Use CaseIncident investigation (why is this service slow?)

Best ApproachAgentic AI

WhyRequires exploration, context, multi-step reasoning

Use CaseNovel operational tasks (tasks you haven't scripted yet)

Best ApproachAgentic AI

WhyNo existing script, needs adaptability

Use CaseCross-tool correlation (K8s + Prometheus + Helm)

Best ApproachAgentic AI

WhySpans multiple tools, needs unified context

Use CaseCompliance/audit operations

Best ApproachAgentic AI with strict gates

WhyNeeds structured execution + approval trail

The pattern: use scripts for predictable, repetitive, well-scoped tasks. Use agentic AI for exploratory, novel, or cross-domain operations where the workflow isn't known in advance.

The Migration Path

You don't rip out your scripts directory on day one. The practical path:

Start with diagnosis. Use the AI agent for read-only investigation. No mutations, no risk. Get comfortable with how the agent explores your cluster.
Add approval-gated operations. Start with low-risk mutations: scaling a staging deployment, restarting a non-critical pod. Build trust in the approval flow.
Retire scripts incrementally. As the agent demonstrates reliability for specific use cases, retire the corresponding scripts. Keep scripts for truly deterministic operations (CronJobs, CI steps).
Expand the tool ecosystem. Add MCP tools for your specific infrastructure: custom CRDs, internal APIs, observability stacks. The agent becomes more capable without writing more scripts.

The goal isn't "replace bash with AI." It's "stop encoding operational knowledge in brittle scripts and start encoding it in composable, safe, verifiable agent workflows."

For a hands-on walkthrough of what this looks like in practice, see Fixing a Latency Spike in payment-service.

Try Skyflo

Skyflo is open-source, self-hosted, and built on the agentic architecture described in this article. Plan → Execute → Verify with human-in-the-loop at every mutating step.

bash

helm repo add skyflo https://charts.skyflo.ai
helm install skyflo skyflo/skyflo

The Automation Spectrum

Why Scripts Break: The Brittleness Problem

Why Raw AI Is Dangerous: The Hallucination Problem

The Agentic Middle Ground

1. Typed Tool Execution Instead of Shell Commands

2. Read/Write Boundary

3. Planning Before Execution

4. Verification After Execution

A Side-by-Side Comparison

MCP: The Bridge Between Intent and Execution

When to Use What

The Migration Path

Try Skyflo

See Skyflo in Action