Why MTTR Is Still Measured in Hours
The Kubernetes ecosystem has never had more observability. Prometheus, Grafana, Datadog, PagerDuty, Jaeger, Loki, the Kubernetes dashboard. The average SRE team has 6+ tools for understanding what's happening in their cluster.
And yet, MTTR hasn't improved proportionally. DORA research consistently shows that even "elite" teams average incident resolution times that surprise people given the tooling available.
The problem isn't any individual tool. The problem is the space between tools.
The Anatomy of a 45-Minute Incident
Let's trace a typical Kubernetes incident from alert to resolution:
Minute 0: Alert fires.
PagerDuty pings you: "payment-service p99 latency > 2s." You're on-call. You open your laptop.
Minutes 1-5: Context acquisition.
You open Grafana. Find the payment-service dashboard. Set the time range to the last 30 minutes. You see the latency spike. But the Grafana dashboard doesn't show pod-level details. You open a terminal.
kubectl get pods -n production -l app=payment-serviceFour pods. One is in CrashLoopBackOff. You need events and logs. You switch back to the terminal.
Minutes 5-12: Signal correlation.
You run kubectl describe pod on the failing pod. OOMKilled. You check resource limits. You open another terminal tab to check kubectl top pods. You switch to Grafana to check the memory usage graph over time. You wonder when the last deployment was. You open ArgoCD or your deployment history tool. Deployment 20 minutes ago: v2.14.0.
You're now in 4 different tools, 3 browser tabs, and 2 terminal windows. You're building a mental model of what happened by manually correlating timestamps across disconnected interfaces.
Minutes 12-20: Root cause identification.
You've pieced it together: the new version uses more memory than the limits allow. But you want to be sure. You check the git diff for v2.14.0. What changed? You open GitHub, find the PR, skim the changes. There's a new batch reconciliation feature. That's likely the memory increase.
Minutes 20-30: Remediation planning.
Now you need to decide: roll back or increase limits? You check the current usage against the proposed new limits. You draft a kubectl patch command. You double-check the namespace. You double-check the resource name. You show the command to a colleague because you don't want to be the person who patched the wrong thing in production.
Minutes 30-38: Execution.
You run the patch. Wait for the rolling update. Watch the pods restart. Check that the new pods aren't OOMKilling.
Minutes 38-45: Verification.
You go back to Grafana. Wait for metrics to update. Check that latency is recovering. Check for new error logs. Update the incident channel: "Fix deployed, monitoring."
Total: 45 minutes. Of which maybe 5 minutes was executing the fix. The other 40 was context acquisition, tool switching, signal correlation, and verification.
Where the Time Actually Goes
Breaking down the 45 minutes by category:
| Phase | Time Spent | Why It Takes So Long |
|---|---|---|
| Context acquisition | 5-8 min | Opening tools, finding dashboards, setting timeranges, running initial queries |
| Signal correlation | 8-12 min | Cross-referencing data from kubectl, Grafana, deployment history, logs, all in different interfaces |
| Root cause identification | 5-10 min | Synthesizing correlated signals into a causal explanation |
| Remediation planning | 5-8 min | Deciding on approach, drafting commands, peer review |
| Execution | 3-5 min | Running the actual fix |
| Verification | 5-8 min | Confirming the fix worked, checking for side effects |
The execution step (the actual fix) is the shortest phase. Everything else is cognitive overhead: gathering information, building a mental model, deciding what to do, and confirming it worked.
This is where AI agents fundamentally change the equation. Not by typing kubectl faster, but by collapsing the phases that surround execution.
How AI Agents Reduce Each Phase
Phase 1: Detection — Faster Signal Correlation
Traditional flow: Alert fires → human opens 3 tools → human manually correlates signals.
Agent flow: Alert fires (or human describes the symptom) → agent queries all relevant data sources in parallel → agent presents correlated findings in a single view.
When you tell Skyflo "payment-service latency has spiked," the agent executes a diagnostic sequence in seconds:
kubernetes.list_pods: finds the CrashLoopBackOff podkubernetes.get_events: finds OOMKilled eventskubernetes.top_pods: finds memory at 94% of limitkubernetes.get_rollout_history: finds recent deployment
All of these are read-only MCP tool calls. No approvals needed. The agent executes them in rapid succession and presents the correlated findings. What took a human 12 minutes of tab-switching takes the agent about 15 seconds.
The time savings come from two sources:
- No context switching. The agent doesn't need to "open Grafana" or "find the dashboard." It queries the data source directly.
- Parallel information gathering. The agent can query multiple tools simultaneously. A human processes information sequentially: check pods, then events, then metrics, then deployment history. The agent collects all of it in one pass.
Phase 2: Diagnosis — Automatic Root Cause Identification
Traditional flow: Human stares at correlated data → builds mental model → forms hypothesis → tests hypothesis manually.
Agent flow: Agent receives correlated data → identifies causal chain → presents root cause with evidence.
The agent doesn't just present raw data. It synthesizes:
"Pod
payment-service-v5n1sis OOMKilled. Memory limit is 256Mi. Actual usage on running pods is 238-251Mi (93-98% of limit). Deployment v2.14.0 was deployed 20 minutes ago. Revision changelog indicates a new batch reconciliation feature. This feature likely increased baseline memory consumption beyond the configured limit."
This synthesis is what takes a human 5-10 minutes of staring at disparate data and "connecting the dots." The agent does it because it has all the data in a single context window (pod status, events, metrics, deployment history) and can identify the causal chain.
Is the agent always right? No. But it's right often enough that the human can quickly confirm or redirect: "That matches what I see. Let's increase the limits." Or: "That's not it — check the config maps too."
Phase 3: Remediation — Safe Execution with Approval
Traditional flow: Human drafts command → peer reviews → human executes → human watches rollout.
Agent flow: Agent proposes specific fix with evidence → human reviews in context → agent executes via typed tool → agent watches rollout automatically.
The agent presents a specific, evidence-based proposal:
Patch deployment/payment-service in production:
resources.limits.memory: 256Mi → 512Mi
resources.requests.memory: 128Mi → 256Mi
Evidence: Current usage 238-251Mi. OOMKill at 256Mi limit.
512Mi provides ~2x headroom based on observed usage.The operator reviews and approves. The agent executes the typed tool call (kubernetes.patch_resource) and monitors the rollout. No drafting commands. No double-checking namespace syntax. No copy-paste errors.
Phase 4: Verification — Automated Post-Fix Validation
Traditional flow: Human goes back to Grafana → waits for metrics update → manually checks pod health → updates incident channel.
Agent flow: The agent automatically checks pod health, resource utilization, error events, and application metrics, then reports evidence-based verification.
The agent doesn't ask "did it work?" It checks:
- Are all pods running and ready?
- Is memory utilization within the new limits?
- Are there new OOMKill events?
- Has p99 latency recovered?
This is the phase humans most often skip or abbreviate under time pressure. "The pods are running, ship it, I'll check Grafana later." The agent does it automatically because it's a structured step in the Plan → Execute → Verify workflow, not a manual afterthought.
The 5-Minute Incident
Same incident, with Skyflo:
| Time | What Happens |
|---|---|
| 0:00 | Alert fires. You type: "payment-service latency spiked, investigate." |
| 0:15 | Agent queries pods, events, metrics, deployment history. Presents correlated findings. |
| 0:45 | Agent identifies root cause: OOMKill due to memory limits too low for v2.14.0. |
| 1:00 | Agent proposes fix: increase memory limits from 256Mi to 512Mi. Shows the specific patch. |
| 1:30 | You review the proposal. Approve. |
| 1:45 | Agent executes the patch via typed MCP tool call. Rolling update starts. |
| 3:30 | Rolling update completes. Agent checks pod health, resource utilization, latency. |
| 4:00 | Agent reports: all pods healthy, memory at 47%, latency at 135ms, no OOMKill events. |
| 4:30 | Done. Incident resolved with full evidence trail. |
Total: under 5 minutes. The 40 minutes of context acquisition, signal correlation, and manual verification collapsed to seconds. The human's involvement was focused on the highest-value moment: reviewing and approving the proposed fix.
What Enables the Speed
The speed improvement isn't from AI being faster at infrastructure commands. It comes from structural changes to the incident response workflow:
Unified context. Instead of building a mental model across 6 tools, the agent builds a complete picture in a single context window. Every tool's output feeds into the same reasoning process.
Evidence-based proposals. The agent doesn't say "try increasing memory." It says "increase memory from 256Mi to 512Mi because current usage is 241Mi and OOMKills started after v2.14.0." The specificity eliminates the back-and-forth of "what should I set it to?"
Automated verification. The step that humans skip under pressure is the step the agent does automatically. This means incidents don't get marked as "resolved" based on gut feeling; they're resolved based on evidence.
Continuous context. If the fix doesn't work, the agent doesn't start over. It has the full context of what was tried, what the result was, and what changed. It continues investigating: "Memory limits were increased but pods are still crashing. New evidence: exit code 137, kill signal from cgroup. Checking node-level memory pressure."
Measuring the Impact
For teams tracking DORA metrics, AI-assisted incident response impacts two metrics directly:
| Metric | Before AI Agent | With AI Agent | Why |
|---|---|---|---|
| MTTR | 30-60 min average | 5-10 min average | Context acquisition and correlation automated |
| Change Failure Rate | Higher (manual operations error-prone) | Lower (typed tools + approval gate + verification) | Structured execution replaces ad-hoc commands |
The secondary effects matter too:
- Reduced toil. The diagnostic steps that SREs repeat for every incident (check pods, check events, check logs, check metrics) are automated. SREs focus on judgment calls, not data gathering.
- Knowledge capture. Every incident response is recorded as a structured conversation: what was asked, what was discovered, what was proposed, what was approved, what was verified. This is institutional knowledge that new team members can learn from.
- Reduced cognitive load during incidents. The agent handles the "what should I check next?" question, which is the hardest question to answer when you're stressed, tired, and paged at 3 AM.
For teams evaluating this approach, the use cases page shows specific operational scenarios. To see the agent architecture that enables this, read How Skyflo Works Under the Hood.
Try Skyflo
Reduce your MTTR from hours to minutes. Open-source, self-hosted, your data never leaves your infrastructure.
helm repo add skyflo https://charts.skyflo.ai
helm install skyflo skyflo/skyflo