Knowledge Graphs for Incident Response

A knowledge graph gives AI SRE agents a map of your entire system during incident investigation. Instead of an AI agent that knows Kubernetes commands but doesn't know your specific services, you have an agent that knows: "orders-service depends on payments-service which depends on the payments database, and orders-service is owned by the Commerce team." This context transforms generic investigation into targeted, system-aware diagnosis.

What is a Service Knowledge Graph?

A service knowledge graph is a graph database (OpenSRE uses Neo4j) that stores your services as nodes and their relationships as edges:

Services → what you run
DEPENDS_ON → which services call which
OWNS → which team owns which service
DEPLOYED_ON → where services run (clusters, nodes)
USES → which databases, queues, or external APIs a service depends on

Unlike a static architecture diagram that goes stale, a knowledge graph is updated continuously — new deployments, new services, new dependencies are reflected in near-real-time.

How It Changes Incident Investigation

Without a Knowledge Graph

When payments-service has errors, an AI agent without service topology knowledge checks: "Is payments-service healthy? Yes/no." It has no way to know that three other services are currently failing because they all depend on payments-service.

With a Knowledge Graph

The AI agent queries: "What services depend on payments-service, directly or transitively?" The graph returns: orders-service, checkout-service, subscription-service. The agent checks all four simultaneously. The blast radius is understood within seconds of investigation start.

Blast Radius Analysis

Blast radius analysis answers: "If this service fails, what else breaks?"

OpenSRE performs blast radius analysis by traversing the dependency graph from the failing node:

MATCH (s:Service)-[:DEPENDS_ON*1..3]->(failing:Service {name: "payments-service"})
RETURN s.name, s.team, length(path) as hops
ORDER BY hops

This returns every service that depends on payments-service, up to 3 hops away. Investigation subagents check each of these services proactively, rather than waiting for more alerts to fire.

Dependency Traversal for Root Cause

When investigating a latency spike in checkout-service, the graph answers: "What does checkout-service depend on?" — payments-service, inventory-service, shipping-service, and user-service. Each of these is a candidate root cause. The knowledge graph focuses the investigation.

Recent Change Detection

One of the most valuable uses of the knowledge graph is change detection. The graph stores deployment events, configuration changes, and infrastructure modifications with timestamps. During investigation:

"What changed near checkout-service in the last 2 hours?"

The graph returns: "payments-service was deployed at 14:32 (32 minutes before the incident)." This is often the fastest path to root cause.

Building Your Knowledge Graph

Automatic Discovery

OpenSRE can populate the knowledge graph automatically from:

Kubernetes service mesh data (Istio, Linkerd): actual traffic flows become dependency edges
Distributed tracing (Jaeger, Datadog APM): trace data reveals service call patterns
API gateway logs: request routing reveals dependencies

Manual Registration

For services that aren't auto-discoverable, use OpenSRE's web console to register services and their dependencies manually.

Neo4j as the Foundation

OpenSRE uses Neo4j for the knowledge graph because:

Graph queries are natural: Cypher makes it easy to ask graph questions (blast radius, shortest path, connected components)
Performance: Graph traversal in Neo4j is O(depth) not O(nodes) — fast even at thousands of services
Flexibility: Schema-free means you can add new node types and relationships as your needs evolve

Knowledge Graph docs → | Try OpenSRE →