Network incidents are expensive in two ways. The obvious cost is downtime. The less obvious cost is engineer time — specifically, the time spent manually correlating signals across multiple monitoring systems, network management platforms, and historical incident databases before a human can even begin to diagnose what went wrong.
For the enterprise I worked with, that investigation phase averaged 20 to 40 minutes per incident. Not resolution time. Investigation time. Before a single remediation action was taken, engineers were spending the better part of an hour just gathering and correlating information that existed in three separate systems that did not talk to each other.
This is exactly the problem AI agents are well-suited to solve.
The Three Data Sources
The network environment used three primary systems that each captured a different slice of the operational picture:
- Aruba provided wireless access point state, client association data, and RF environment visibility
- Cisco provided wired switching and routing state, interface counters, and spanning tree topology
- IPfabric provided a vendor-agnostic logical model of the entire network topology, including path analysis and historical state snapshots
During an incident, an engineer would need to query all three systems, reconcile the data formats, identify which signals were correlated versus coincidental, and form a hypothesis about the root cause. Each system had its own UI and query language. There was no unified view.
The Agent Architecture
The agent is structured in three layers. A data ingestion layer maintains live connections to all three APIs and normalizes their outputs into a unified schema. When an alert fires, the ingestion layer immediately queries all three systems for the relevant segment of the network and returns structured data within seconds.
A correlation engine takes that structured data and identifies which signals are temporally and topologically correlated. Not every signal that appears during an incident is causally related to it. The correlation engine filters signal from noise using graph-based analysis of the network topology model from IPfabric.
The reasoning agent, powered by Amazon Bedrock with Claude, takes the correlated signal set and reasons about probable root causes. It has access to a knowledge base of historical incident patterns and their resolutions, which it uses to rank hypotheses by prior frequency and evidence match. The output is a structured hypothesis with ranked causes, supporting evidence from each data source, and a recommended remediation path.
What the First Version Got Wrong
The first version of this agent gave engineers too much information. The hypothesis output included all correlated signals, all candidate root causes, and all potential remediations. Engineers found it overwhelming. They spent as much time reading the agent's output as they had previously spent investigating manually.
The lesson: the value of an AI agent in an operations context is not more information. It is the right information, ranked, with evidence. Engineers are experts. They do not need a wall of text. They need the top 2-3 hypotheses, the evidence for each, and a clear recommended next action.
The second version enforced a strict output format: the top-ranked hypothesis, the three most significant supporting signals, and one recommended remediation step. Engineers could expand to see the full analysis if needed. Most did not need to.
Grounding Is Everything
The most important architectural decision was the emphasis on grounding. The reasoning agent does not generate hypotheses from parametric knowledge about how networks behave in general. It generates hypotheses from the actual state of this specific network, at this specific moment, as observed through the three data sources.
This matters because a hallucinated network topology is worse than no information. If the agent confidently describes a path through a switch that does not exist in this environment, engineers waste time chasing a ghost. Every factual claim in the hypothesis output is tagged with its source: which system it came from, what the raw value was, and when it was observed. Engineers can verify any claim in under ten seconds.
The Result
The 65% MTTR reduction came almost entirely from collapsing the investigation phase. Engineers received a grounded, evidence-backed hypothesis within 60 seconds of an alert firing. The time from alert to first remediation action dropped from 20-40 minutes to under 5 minutes in most cases.
The agent did not eliminate engineers from the loop. They still make all remediation decisions. What it eliminated was the time those engineers spent doing information retrieval and correlation work that a machine could do faster and more consistently.
That is where AI agents provide the most reliable value in operations contexts today: not autonomous decision-making, but dramatically accelerated expert decision-making.