The Scenario
It's 2:47 PM on a Tuesday. Your monitoring fires an alert: p99 latency for payment-service has jumped from 120ms to 2,400ms in the last 15 minutes. The Slack channel lights up. Someone mentions a deployment went out 20 minutes ago.
You have two options: start opening tabs (Grafana, kubectl terminal, the deployment history in ArgoCD, the recent PR on GitHub), or type one sentence into Skyflo:
"payment-service latency has spiked in the last 15 minutes. Can you investigate?"
This walkthrough follows the full incident lifecycle through Skyflo's Plan → Execute → Verify loop.
Phase 1: Investigation
The agent receives your request and starts with resource discovery. It doesn't guess; it queries your cluster systematically using read-only MCP tools. No approvals needed for reads.
Step 1: Identify the workload and current state.
The agent calls kubernetes.list_pods:
$ kubectl get pods -n production -l app=payment-service
NAME READY STATUS RESTARTS AGE
payment-service-6b8f9d4c7-h2k4l 1/1 Running 0 18m
payment-service-6b8f9d4c7-m9x2p 1/1 Running 0 18m
payment-service-6b8f9d4c7-q7j8r 1/1 Running 3 18m
payment-service-6b8f9d4c7-v5n1s 0/1 CrashLoopBackOff 5 18mFour pods. One in CrashLoopBackOff, one with 3 restarts. All deployed 18 minutes ago, which matches the timeline of the reported deployment.
Step 2: Check pod events and logs.
The agent calls kubernetes.get_events and kubernetes.get_pod_logs for the failing pod:
$ kubectl describe pod payment-service-6b8f9d4c7-v5n1s -n production
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 18m default-scheduler Successfully assigned production/payment-service-6b8f9d4c7-v5n1s
Normal Pulled 18m kubelet Container image "registry.internal/payment-service:v2.14.0" pulled
Normal Started 18m kubelet Started container payment-service
Warning OOMKilled 16m kubelet Container payment-service exceeded memory limit
Normal Pulled 16m (x5 over 18m) kubelet Container image pulled
Warning BackOff 2m (x14 over 16m) kubelet Back-off restarting failed containerOOMKilled. The container is exceeding its memory limit.
Step 3: Check resource configuration.
The agent calls kubernetes.get_resource to inspect the deployment spec:
$ kubectl get deployment payment-service -n production -o jsonpath='{.spec.template.spec.containers[0].resources}'
{
"requests": { "cpu": "100m", "memory": "128Mi" },
"limits": { "cpu": "250m", "memory": "256Mi" }
}Memory limit is 256Mi. That was fine for v2.13.x. Let's see what v2.14.0 actually needs.
Step 4: Check actual resource usage.
The agent calls kubernetes.top_pods:
$ kubectl top pods -n production -l app=payment-service
NAME CPU(cores) MEMORY(bytes)
payment-service-6b8f9d4c7-h2k4l 89m 241Mi
payment-service-6b8f9d4c7-m9x2p 94m 238Mi
payment-service-6b8f9d4c7-q7j8r 102m 251Mi
payment-service-6b8f9d4c7-v5n1s 0m 0Mi (CrashLoopBackOff)The running pods are using 238–251Mi, right up against the 256Mi limit. The pod that's OOMKilling is likely spiking past the limit on request processing.
Step 5: Check the previous deployment's resource usage for comparison.
The agent calls kubernetes.get_rollout_history:
$ kubectl rollout history deployment/payment-service -n production
REVISION CHANGE-CAUSE
3 v2.13.2 - routine patch
4 v2.14.0 - added payment reconciliation batch jobRevision 4 added a batch reconciliation job. That explains the memory increase: the new code path is loading reconciliation data into memory.
The Agent's Assessment
After collecting all this evidence, the agent produces a structured assessment:
Root Cause: Deployment v2.14.0 introduced a payment reconciliation feature that increases memory consumption beyond the current 256Mi limit. Pods are running at 93–98% of the memory limit under normal load, with spikes causing OOMKill.
Impact: 1 of 4 pods is in CrashLoopBackOff. Remaining pods are operating near their memory limit, causing garbage collection pressure and increased latency (120ms → 2,400ms p99).
Recommended Action: Increase memory limit from 256Mi to 512Mi and memory request from 128Mi to 256Mi. This provides headroom for the reconciliation feature while maintaining a 2:1 limit-to-request ratio.
Alternative: Roll back to v2.13.2 (revision 3) to restore service immediately, then address resource limits before redeploying v2.14.0.
The agent presents both options to you. This is the first decision point: the agent proposes, the human decides.
Phase 2: Applying the Fix
You choose the first option: increase memory limits. Rolling back would fix the immediate issue but block the feature release. Increasing limits is the right fix.
The agent prepares the mutation. Before execution, it presents the exact change for approval:
Approval Request:
Action: Patch deployment/payment-service in namespace production
Change:
containers[0].resources.requests.memory: 128Mi → 256Mi
containers[0].resources.limits.memory: 256Mi → 512Mi
This will trigger a rolling update of all pods.You review the patch. The namespace is correct (production). The resource is correct (payment-service deployment, not a similarly named one in another namespace). The values are reasonable. You approve.
The agent executes the MCP tool call:
Tool: kubernetes.patch_resource
Parameters:
resource_type: deployment
name: payment-service
namespace: production
patch:
spec:
template:
spec:
containers:
- name: payment-service
resources:
requests:
memory: "256Mi"
limits:
memory: "512Mi"Execution result:
$ kubectl get deployment payment-service -n production
NAME READY UP-TO-DATE AVAILABLE AGE
payment-service 3/4 4 3 45m
$ kubectl rollout status deployment/payment-service -n production
Waiting for deployment "payment-service" rollout to finish: 2 of 4 updated replicas are available...
Waiting for deployment "payment-service" rollout to finish: 3 of 4 updated replicas are available...
deployment "payment-service" successfully rolled outThe rolling update is progressing. Old pods are being replaced with new ones using the updated resource limits.
Phase 3: Verification
Once execution completes, the agent moves to verification. It doesn't trust "rollout succeeded." It checks actual system state.
Verification Step 1: Pod health.
$ kubectl get pods -n production -l app=payment-service
NAME READY STATUS RESTARTS AGE
payment-service-8c5a1e3f9-a4b7d 1/1 Running 0 2m
payment-service-8c5a1e3f9-c8e2f 1/1 Running 0 2m
payment-service-8c5a1e3f9-g6h9j 1/1 Running 0 1m
payment-service-8c5a1e3f9-k3l5m 1/1 Running 0 1mAll 4 pods running, 0 restarts. No CrashLoopBackOff.
Verification Step 2: Resource usage against new limits.
$ kubectl top pods -n production -l app=payment-service
NAME CPU(cores) MEMORY(bytes)
payment-service-8c5a1e3f9-a4b7d 78m 243Mi
payment-service-8c5a1e3f9-c8e2f 82m 239Mi
payment-service-8c5a1e3f9-g6h9j 75m 245Mi
payment-service-8c5a1e3f9-k3l5m 80m 241MiMemory usage is 239–245Mi against a 512Mi limit. That's 47–48% utilization. Healthy headroom.
Verification Step 3: Check that the original problem is resolved.
The agent checks for recent OOMKill events:
$ kubectl get events -n production --field-selector reason=OOMKilling --sort-by='.lastTimestamp'
No resources found in production namespace.No new OOMKill events since the fix.
Verification Step 4: Latency check.
If Prometheus is configured as an MCP tool, the agent queries it:
$ promql: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="payment-service"}[5m]))
Result: 0.135 (135ms)p99 latency is back to 135ms, within normal range. The spike is resolved.
The Verification Report
Verification: PASSED
- All 4 pods healthy, 0 restarts, 0 OOMKill events post-fix
- Memory utilization at 47% of new limit (healthy headroom)
- p99 latency recovered: 2,400ms → 135ms
- No cascading effects detected in dependent services
Root cause: Deployment v2.14.0 increased memory consumption beyond the 256Mi limit due to the payment reconciliation feature. Memory limit increased to 512Mi.
Recommendation: Update the Helm chart defaults for payment-service to reflect the new memory requirements before the next release.
What Made This Different From Manual Troubleshooting?
Let's map the same incident to the manual approach:
| Step | Manual Process | With Skyflo |
|---|---|---|
| Identify pods | Open terminal, kubectl get pods, scan output | Automatic resource discovery |
| Check events | kubectl describe pod, scroll through events | Targeted event retrieval, OOMKill flagged |
| Check resources | Multiple kubectl get commands with jsonpath | Structured resource extraction |
| Check metrics | Switch to Grafana, find the dashboard, set timerange | MCP tool call to Prometheus, correlated with pod data |
| Correlate | Mental model across 4+ tabs and terminal windows | Agent correlates all signals in one context |
| Propose fix | Draft the patch, double-check namespace and values | Structured patch with clear diff |
| Apply | kubectl patch or kubectl edit, hope for no typo | Typed tool call with schema validation + approval gate |
| Verify | Re-check pods, metrics, events (if you remember to) | Automatic multi-step verification |
kubectl get pods, scan outputkubectl describe pod, scroll through eventskubectl get commands with jsonpathkubectl patch or kubectl edit, hope for no typoTotal manual time: 30–45 minutes (more if you're context-switching during an incident). Total Skyflo time: 5–8 minutes, most of which is the rolling update itself.
The speed improvement isn't from AI being faster at typing kubectl. It's from eliminating context-switching, automating signal correlation, and structuring the verification step that humans often skip under pressure.
Key Takeaways
- The agent doesn't guess. It queries your cluster systematically before proposing anything. Every recommendation is grounded in actual resource state, not pattern-matched from training data.
- The approval gate shows you exactly what will change. Not "I'll fix the memory issue," but the specific patch, the target resource, and the namespace. You approve the action, not the intent.
- Verification is automatic and evidence-based. The agent doesn't ask you if it worked. It checks pod health, resource utilization, error events, and application metrics. It reports evidence, not opinions.
- The loop handles complexity. If the agent had found that pods were still OOMKilling after the increase, it would have looped back to propose a larger increase or a rollback. The agent adapts; it doesn't blindly follow a script.
For more operational scenarios, see our use cases page. To understand the architecture behind this walkthrough, read How Skyflo Works Under the Hood.
Try Skyflo
Skyflo is open-source and self-hosted. Run your first diagnostic in minutes.
helm repo add skyflo https://charts.skyflo.ai
helm install skyflo skyflo/skyflo