OpenSRE is built on three core systems working together: a LangGraph-orchestrated agent pipeline, an episodic memory system, and a Neo4j knowledge graph. Understanding how these interact explains how OpenSRE investigates incidents.
Slack → slack-bot (Bolt/Socket Mode) → sre-agent (LangGraph)
Web UI ──────────────────────────────→ │
┌────┴────┐
│ │ │
Memory Skills KG
│ │
PostgreSQL Neo4j
config-service ← used by web_ui, slack-bot, sre-agent
Two entry points: Slack (via slack-bot) and the web console (via web_ui). Both stream results via Server-Sent Events from sre-agent.
The investigation pipeline is a directed graph with these nodes:
| Node | Role |
|------|------|
| init_context | Parses the alert, loads episodic memory context |
| planner | Breaks the investigation into parallel subtasks |
| subagent_executor | Executes one investigation subtask |
| synthesizer | Combines findings from all subagents |
| writeup | Produces the final incident report |
| memory_store | Stores the episode in episodic memory |
The key architectural decision is the Send() fan-out: the planner emits multiple Send("subagent_executor", task) events that execute in parallel. Each subagent has access to 46 investigation skills and runs its subtask independently. This parallel execution is what makes OpenSRE fast.
Data flow:
Alert → init_context → planner → [Send() fan-out]
↓
subagent_executor × N (parallel)
↓
synthesizer → writeup → memory_store
After every investigation, OpenSRE stores the episode in its episodic memory. The episodic memory lifecycle:
init_context queries episodic memory for similar past episodes using weighted scoring: alert_type (0.5), service (0.3), resolved status (0.2)This is what makes OpenSRE get better over time. The first time you see a payments-service outage, it takes longer. The tenth time, it has patterns, root causes, and strategies from past episodes.
OpenSRE maintains a live service topology graph in Neo4j:
During investigation, agents can query the graph:
The 46 investigation skills are loaded on-demand. When an agent needs to check Kubernetes pod status, it calls load_skill("k8s-debug") which loads the skill's context and tools, then calls run_script to execute specific checks. This progressive loading keeps the agent's context window manageable.
Skills are organized by domain:
| Service | Host Port | Description | |---------|-----------|-------------| | PostgreSQL | 5433 | Primary database | | config-service | 8081 | Configuration API | | Neo4j HTTP | 7475 | Neo4j browser | | Neo4j Bolt | 7688 | Neo4j driver connection | | LiteLLM | 4001 | LLM proxy | | sre-agent | 8001 | Investigation agent API | | web-ui | 3002 | Admin console |