文件预览

investigation-protocol.md

查看 Vmware Monitor 技能包中的文件内容。

返回技能详情下载技能包打开来源页

文件内容

references/investigation-protocol.md

# Investigation Protocol — Causal Chain Root Cause Analysis

A protocol for AI agents performing diagnostic investigations on VMware infrastructure (alarms, performance regressions, availability incidents). Adopted from Enterprise Harness Engineering, drawing on 5 Whys, Google SRE, ITIL, and NASA Fault Tree Analysis.

## When to Apply

Use this protocol whenever the user asks:

- "Why is X slow / failing / down?"
- "What caused this alarm / alert / incident?"
- "Investigate / diagnose / debug …"
- Any open-ended question that requires identifying a root cause rather than just reading state.

Do NOT apply for:

- Simple state lookups ("is the VM on?", "list datastores")
- Operational requests ("clone this VM", "create this rule")
- Configuration questions ("what's the default for X?")

## The Four Criteria for Root Cause Completeness

A diagnostic conclusion is **incomplete** unless ALL four criteria are satisfied. The agent must self-check against each one before outputting a report.

### 1. Falsifiability (可证伪性)

The root cause must be independently measurable and verifiable. If you cannot test it, it is a hypothesis, not a root cause.

- ✅ "Datastore latency exceeded 50ms because IOPS hit the SAN cap of 10,000" — directly testable via `get_metrics datastore.iops`
- ❌ "Network was congested" — too vague to verify

### 2. Sufficiency (充分性)

Removing the root cause must make the symptom disappear. If the symptom persists after the supposed fix, the cause was wrong or partial.

- ✅ "Deleting the orphaned snapshot freed 200 GB and the alarm cleared within 60 seconds"
- ❌ "Restarted the VM and the issue went away" — correlation, not causation

### 3. Necessity (必要性)

The symptom must occur whenever the root cause is present. If the same condition exists elsewhere without the symptom, you have not found the true root cause.

- ✅ "Every cluster with 80%+ memory overcommit shows the same vMotion stall"
- ❌ "Only this one VM has the issue" — without explaining why this VM specifically

### 4. Mechanism (机制性)

You must explain the propagation chain: root cause → propagation → amplification → impact. A single point claim with no mechanism is a guess.

- ✅ "Snapshot delta files filled the datastore (root) → VM I/O blocked on write (propagation) → guest filesystem went read-only (amplification) → application timeout (impact)"
- ❌ "The datastore was full" — describes a state, not a chain

## Investigation Workflow — Up to Three Depth Rounds

### Round 1 — Initial Hypothesis

1. Gather symptoms via L1/L2 read tools (alarms, metrics, events, logs)
2. Form an initial causal chain hypothesis
3. Apply the four criteria

If all four pass → output report.
If any criterion fails → proceed to Round 2 with that criterion as the focus.

### Round 2 — Targeted Deepening

1. Identify which criterion failed
2. Gather additional evidence aimed specifically at that criterion (e.g. failed Necessity → compare against unaffected peers; failed Mechanism → trace next propagation step)
3. Refine the causal chain
4. Re-apply the four criteria

If all four pass → output report.
If any still fails → proceed to Round 3.

### Round 3 — Final Deepen or Escalate

1. If a deeper cause is reachable, gather final evidence and finalize the chain
2. If evidence is unavailable, system-bounded, or beyond the agent's tool surface, **escalate to a human** and explicitly label the conclusion as `⚠️ INCOMPLETE — <criterion> unsatisfied`
3. **Never** silently output a partial conclusion as if it were complete

## Output Format

Every investigation report must structure findings exactly as:

```
🔴 [ROOT CAUSE]   <falsifiable, mechanism-explained statement>
  → [PROPAGATION] <how the root cause spread to neighboring systems>
    → [AMPLIFICATION] <what made the impact worse, if applicable>
      → [IMPACT]    <observable user / business / SLA effect>

✅ Falsifiability:  <evidence — metric name, log query, command output>
✅ Sufficiency:     <evidence or stated counterfactual>
✅ Necessity:       <evidence or peer comparison>
✅ Mechanism:       <see propagation chain above>
```

If any criterion is unmet, mark it `⚠️ INCOMPLETE — <reason>` and state explicitly what additional evidence would be required to satisfy it.

## Anti-Patterns

| ❌ Pattern | Why it fails |
|---|---|
| "thanos-cn unreachable" alone | Describes symptom; does not answer **why** unreachable |
| "Datastore full" alone | No propagation, no impact chain |
| "Try restarting it" | Skips diagnosis entirely |
| "Probably the network" | Not falsifiable |
| Stopping at the first plausible cause | Skips Necessity check |
| Silent partial conclusion | Hides incompleteness from the user |

## Worked Examples

### Bad — Incomplete Diagnosis

> "VM is slow because the host is busy."

Missing:
- **Falsifiability**: which metric, what threshold?
- **Necessity**: why this VM only?
- **Mechanism**: how does host load translate into VM slowness?

### Good — Complete Diagnosis

> 🔴 [ROOT] Host `esx-03` CPU ready time exceeds 15% (validated via `get_metrics host.cpu.ready`)
>   → [PROPAGATION] vCPU contention from 4-VM reservation collision in resource pool `prod-rp`
>     → [AMPLIFICATION] DRS is in manual mode, so VMs are not rebalanced
>       → [IMPACT] Application p99 latency doubled from 200 ms to 400 ms
>
> ✅ Falsifiability: `host.cpu.ready` metric directly observable; threshold defined in vSphere docs
> ✅ Sufficiency: vMotion `vm-A` off `esx-03` reduced ready time to 3% and p99 latency back to 200 ms
> ✅ Necessity: only VMs in `prod-rp` with active reservations are affected; identical workloads in `staging-rp` are healthy
> ✅ Mechanism: cpu.ready = vCPU waiting for pCPU → guest perceives as CPU starvation → app threadpool exhaustion → tail latency

## Related Skills

A complete investigation often chains across skills:

- **vmware-monitor** (this skill): inventory, alarms, events — code-level read-only data source
- [vmware-aria](https://github.com/zw008/VMware-Aria): metrics, alerts, anomaly detection — primary L1/L2 data source for time-series analysis
- [vmware-aiops](https://github.com/zw008/VMware-AIops): VM/host state, deployment history; can also remediate at L3+ once the investigation is complete and approved
- [vmware-pilot](https://github.com/zw008/VMware-Pilot): orchestrate the investigation itself as a multi-step Dispatcher → Subagent workflow

The agent should treat investigation as **read-heavy first**: gather across skills, reason centrally, only invoke L3+ write tools after the four criteria are satisfied AND the user has approved a remediation plan.