Episodic Memory System

OpenSRE's episodic memory is a system that remembers past investigations and uses them to guide future ones. After every investigation, OpenSRE extracts structured metadata from the outcome and stores it. When a similar incident occurs, this stored knowledge is retrieved and injected into the new investigation's context — like a senior SRE recalling what worked last time.

Why Episodic Memory Matters

Without episodic memory, every incident investigation starts from scratch. An AI agent has no knowledge of past outages, no patterns to recognize, no proven approaches to try first. The first investigation of a payments-service timeout takes as long as the tenth.

With episodic memory, OpenSRE builds institutional knowledge:

  • Recognizes patterns: "This alert type usually indicates a connection pool exhaustion"
  • Recalls root causes: "Last time this happened, it was a bad deployment at 14:32"
  • Applies strategies: "For this class of incident, start with Kubernetes pod restarts, then check Datadog APM"

The Episodic Memory Lifecycle

1. Investigation Completes

The writeup node produces a structured incident report.

2. LLM Metadata Extraction

An LLM extracts structured metadata from the investigation:

  • Summary: 2-3 sentence description of what happened
  • Root cause: The identified root cause
  • Alert type: Category of alert (e.g., high_error_rate, pod_crashloop, latency_spike)
  • Affected services: List of services involved
  • Severity: critical / high / medium / low
  • Resolution status: resolved / unresolved / partial

3. Episode Storage

The episode is stored in PostgreSQL via the config-service API. All investigations are stored, not just resolved ones — unresolved incidents are valuable for learning too.

4. Similarity Search

Before a new investigation begins, init_context queries episodic memory for similar past episodes using weighted scoring:

| Factor | Weight | |--------|--------| | Alert type match | 0.5 | | Service overlap | 0.3 | | Resolution status | 0.2 |

5. Context Injection

The top matching episodes are formatted and injected into the planner's context. The planner can see: "Last time this alert fired on payments-service, it was a database connection pool issue resolved by restarting the connection pool manager."

6. Strategy Generation

When 2 or more episodes share the same alert type, OpenSRE automatically generates a reusable investigation strategy. This strategy captures the common investigation path: which skills to run first, what patterns to look for, which services to prioritize.

What Gets Better Over Time

| After N investigations | What improves | |------------------------|--------------| | 1 | Baseline performance | | 2-3 | Similar incidents get context from past episodes | | 5+ | Strategies auto-generate for common alert types | | 10+ | High accuracy pattern recognition for recurring issues |

Viewing Episodic Memory

The web console at http://localhost:3002 includes an episodic memory browser:

  • Episode list with severity, services, and resolution status filters
  • Full investigation history per episode
  • Strategy viewer showing auto-generated strategies per alert type
  • Dashboard stats: total episodes, resolution rate, average investigation depth