Blog

Fixing a Latency Spike in payment-service: A Real Skyflo Walkthrough

A full incident walkthrough showing Plan-Execute-Verify in action — from natural language query to root cause identification, approved fix, and verified resolution.

10 min read
walkthroughkubernetestroubleshootingincident-responseplan-execute-verify

The Scenario

It's 2:47 PM on a Tuesday. Your monitoring fires an alert: p99 latency for payment-service has jumped from 120ms to 2,400ms in the last 15 minutes. The Slack channel lights up. Someone mentions a deployment went out 20 minutes ago.

You have two options: start opening tabs (Grafana, kubectl terminal, the deployment history in ArgoCD, the recent PR on GitHub), or type one sentence into Skyflo:

"payment-service latency has spiked in the last 15 minutes. Can you investigate?"

This walkthrough follows the full incident lifecycle through Skyflo's Plan → Execute → Verify loop.


Phase 1: Investigation

The agent receives your request and starts with resource discovery. It doesn't guess; it queries your cluster systematically using read-only MCP tools. No approvals needed for reads.

Step 1: Identify the workload and current state.

The agent calls kubernetes.list_pods:

code
$ kubectl get pods -n production -l app=payment-service
NAME                              READY   STATUS    RESTARTS   AGE
payment-service-6b8f9d4c7-h2k4l  1/1     Running   0          18m
payment-service-6b8f9d4c7-m9x2p  1/1     Running   0          18m
payment-service-6b8f9d4c7-q7j8r  1/1     Running   3          18m
payment-service-6b8f9d4c7-v5n1s  0/1     CrashLoopBackOff  5  18m

Four pods. One in CrashLoopBackOff, one with 3 restarts. All deployed 18 minutes ago, which matches the timeline of the reported deployment.

Step 2: Check pod events and logs.

The agent calls kubernetes.get_events and kubernetes.get_pod_logs for the failing pod:

code
$ kubectl describe pod payment-service-6b8f9d4c7-v5n1s -n production
...
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  18m                default-scheduler  Successfully assigned production/payment-service-6b8f9d4c7-v5n1s
  Normal   Pulled     18m                kubelet            Container image "registry.internal/payment-service:v2.14.0" pulled
  Normal   Started    18m                kubelet            Started container payment-service
  Warning  OOMKilled  16m                kubelet            Container payment-service exceeded memory limit
  Normal   Pulled     16m (x5 over 18m)  kubelet            Container image pulled
  Warning  BackOff    2m (x14 over 16m)  kubelet            Back-off restarting failed container

OOMKilled. The container is exceeding its memory limit.

Step 3: Check resource configuration.

The agent calls kubernetes.get_resource to inspect the deployment spec:

code
$ kubectl get deployment payment-service -n production -o jsonpath='{.spec.template.spec.containers[0].resources}'
{
  "requests": { "cpu": "100m", "memory": "128Mi" },
  "limits":   { "cpu": "250m", "memory": "256Mi" }
}

Memory limit is 256Mi. That was fine for v2.13.x. Let's see what v2.14.0 actually needs.

Step 4: Check actual resource usage.

The agent calls kubernetes.top_pods:

code
$ kubectl top pods -n production -l app=payment-service
NAME                              CPU(cores)   MEMORY(bytes)
payment-service-6b8f9d4c7-h2k4l  89m          241Mi
payment-service-6b8f9d4c7-m9x2p  94m          238Mi
payment-service-6b8f9d4c7-q7j8r  102m         251Mi
payment-service-6b8f9d4c7-v5n1s  0m           0Mi     (CrashLoopBackOff)

The running pods are using 238–251Mi, right up against the 256Mi limit. The pod that's OOMKilling is likely spiking past the limit on request processing.

Step 5: Check the previous deployment's resource usage for comparison.

The agent calls kubernetes.get_rollout_history:

code
$ kubectl rollout history deployment/payment-service -n production
REVISION  CHANGE-CAUSE
3         v2.13.2 - routine patch
4         v2.14.0 - added payment reconciliation batch job

Revision 4 added a batch reconciliation job. That explains the memory increase: the new code path is loading reconciliation data into memory.


The Agent's Assessment

After collecting all this evidence, the agent produces a structured assessment:

Root Cause: Deployment v2.14.0 introduced a payment reconciliation feature that increases memory consumption beyond the current 256Mi limit. Pods are running at 93–98% of the memory limit under normal load, with spikes causing OOMKill.

Impact: 1 of 4 pods is in CrashLoopBackOff. Remaining pods are operating near their memory limit, causing garbage collection pressure and increased latency (120ms → 2,400ms p99).

Recommended Action: Increase memory limit from 256Mi to 512Mi and memory request from 128Mi to 256Mi. This provides headroom for the reconciliation feature while maintaining a 2:1 limit-to-request ratio.

Alternative: Roll back to v2.13.2 (revision 3) to restore service immediately, then address resource limits before redeploying v2.14.0.

The agent presents both options to you. This is the first decision point: the agent proposes, the human decides.


Phase 2: Applying the Fix

You choose the first option: increase memory limits. Rolling back would fix the immediate issue but block the feature release. Increasing limits is the right fix.

The agent prepares the mutation. Before execution, it presents the exact change for approval:

Approval Request:

code
Action: Patch deployment/payment-service in namespace production
Change:
  containers[0].resources.requests.memory: 128Mi → 256Mi
  containers[0].resources.limits.memory: 256Mi → 512Mi

This will trigger a rolling update of all pods.

You review the patch. The namespace is correct (production). The resource is correct (payment-service deployment, not a similarly named one in another namespace). The values are reasonable. You approve.

The agent executes the MCP tool call:

code
Tool: kubernetes.patch_resource
Parameters:
  resource_type: deployment
  name: payment-service
  namespace: production
  patch:
    spec:
      template:
        spec:
          containers:
            - name: payment-service
              resources:
                requests:
                  memory: "256Mi"
                limits:
                  memory: "512Mi"

Execution result:

code
$ kubectl get deployment payment-service -n production
NAME              READY   UP-TO-DATE   AVAILABLE   AGE
payment-service   3/4     4            3           45m

$ kubectl rollout status deployment/payment-service -n production
Waiting for deployment "payment-service" rollout to finish: 2 of 4 updated replicas are available...
Waiting for deployment "payment-service" rollout to finish: 3 of 4 updated replicas are available...
deployment "payment-service" successfully rolled out

The rolling update is progressing. Old pods are being replaced with new ones using the updated resource limits.


Phase 3: Verification

Once execution completes, the agent moves to verification. It doesn't trust "rollout succeeded." It checks actual system state.

Verification Step 1: Pod health.

code
$ kubectl get pods -n production -l app=payment-service
NAME                              READY   STATUS    RESTARTS   AGE
payment-service-8c5a1e3f9-a4b7d  1/1     Running   0          2m
payment-service-8c5a1e3f9-c8e2f  1/1     Running   0          2m
payment-service-8c5a1e3f9-g6h9j  1/1     Running   0          1m
payment-service-8c5a1e3f9-k3l5m  1/1     Running   0          1m

All 4 pods running, 0 restarts. No CrashLoopBackOff.

Verification Step 2: Resource usage against new limits.

code
$ kubectl top pods -n production -l app=payment-service
NAME                              CPU(cores)   MEMORY(bytes)
payment-service-8c5a1e3f9-a4b7d  78m          243Mi
payment-service-8c5a1e3f9-c8e2f  82m          239Mi
payment-service-8c5a1e3f9-g6h9j  75m          245Mi
payment-service-8c5a1e3f9-k3l5m  80m          241Mi

Memory usage is 239–245Mi against a 512Mi limit. That's 47–48% utilization. Healthy headroom.

Verification Step 3: Check that the original problem is resolved.

The agent checks for recent OOMKill events:

code
$ kubectl get events -n production --field-selector reason=OOMKilling --sort-by='.lastTimestamp'
No resources found in production namespace.

No new OOMKill events since the fix.

Verification Step 4: Latency check.

If Prometheus is configured as an MCP tool, the agent queries it:

code
$ promql: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="payment-service"}[5m]))
Result: 0.135 (135ms)

p99 latency is back to 135ms, within normal range. The spike is resolved.


The Verification Report

Verification: PASSED

- All 4 pods healthy, 0 restarts, 0 OOMKill events post-fix

- Memory utilization at 47% of new limit (healthy headroom)

- p99 latency recovered: 2,400ms → 135ms

- No cascading effects detected in dependent services

Root cause: Deployment v2.14.0 increased memory consumption beyond the 256Mi limit due to the payment reconciliation feature. Memory limit increased to 512Mi.

Recommendation: Update the Helm chart defaults for payment-service to reflect the new memory requirements before the next release.


What Made This Different From Manual Troubleshooting?

Let's map the same incident to the manual approach:

StepIdentify pods
Manual ProcessOpen terminal, kubectl get pods, scan output
With SkyfloAutomatic resource discovery
StepCheck events
Manual Processkubectl describe pod, scroll through events
With SkyfloTargeted event retrieval, OOMKill flagged
StepCheck resources
Manual ProcessMultiple kubectl get commands with jsonpath
With SkyfloStructured resource extraction
StepCheck metrics
Manual ProcessSwitch to Grafana, find the dashboard, set timerange
With SkyfloMCP tool call to Prometheus, correlated with pod data
StepCorrelate
Manual ProcessMental model across 4+ tabs and terminal windows
With SkyfloAgent correlates all signals in one context
StepPropose fix
Manual ProcessDraft the patch, double-check namespace and values
With SkyfloStructured patch with clear diff
StepApply
Manual Processkubectl patch or kubectl edit, hope for no typo
With SkyfloTyped tool call with schema validation + approval gate
StepVerify
Manual ProcessRe-check pods, metrics, events (if you remember to)
With SkyfloAutomatic multi-step verification

Total manual time: 30–45 minutes (more if you're context-switching during an incident). Total Skyflo time: 5–8 minutes, most of which is the rolling update itself.

The speed improvement isn't from AI being faster at typing kubectl. It's from eliminating context-switching, automating signal correlation, and structuring the verification step that humans often skip under pressure.


Key Takeaways

  1. The agent doesn't guess. It queries your cluster systematically before proposing anything. Every recommendation is grounded in actual resource state, not pattern-matched from training data.
  2. The approval gate shows you exactly what will change. Not "I'll fix the memory issue," but the specific patch, the target resource, and the namespace. You approve the action, not the intent.
  3. Verification is automatic and evidence-based. The agent doesn't ask you if it worked. It checks pod health, resource utilization, error events, and application metrics. It reports evidence, not opinions.
  4. The loop handles complexity. If the agent had found that pods were still OOMKilling after the increase, it would have looped back to propose a larger increase or a rollback. The agent adapts; it doesn't blindly follow a script.

For more operational scenarios, see our use cases page. To understand the architecture behind this walkthrough, read How Skyflo Works Under the Hood.


Try Skyflo

Skyflo is open-source and self-hosted. Run your first diagnostic in minutes.

bash
helm repo add skyflo https://charts.skyflo.ai
helm install skyflo skyflo/skyflo
Schedule a Demo

See Skyflo in Action

Book a personalized demo with our team. We'll show you how Skyflo can transform your DevOps workflows.