Architecture11 min readMarch 2026
AL
AI Agent Framework Specialists

AI Agent Memory Patterns: Short-Term, Long-Term, Episodic, and Semantic

A technical guide to the four memory types available to AI agents — in-context, external key-value, vector episodic, and knowledge graph semantic — with implementation patterns and trade-off analysis.

Why Memory Is the Defining Constraint of Agent Capability

The difference between a chatbot and an agent is not tool access — it's memory. A chatbot without memory resets with every session, treating each interaction as a standalone transaction. An agent with well-designed memory accumulates context across sessions, learns from past actions, and develops a working model of the environment it operates in. Memory is what allows an agent to notice that a particular API endpoint has been flaky for three days, that a user prefers formal communication, or that a compliance process requires a specific approval step before proceeding. The challenge is that language models are stateless by design: they have no built-in persistence beyond the context window. Every form of agent memory is an engineering layer built on top of this stateless foundation. There are four distinct memory types, each with different latency, capacity, durability, and retrieval characteristics. They're not mutually exclusive — production agents typically combine two or three. Understanding the trade-offs of each is prerequisite to designing a memory architecture that matches your agent's actual requirements. The Framework Radar on AgentList maps which frameworks natively support which memory types, which is useful when your memory requirements should inform your framework selection rather than the other way around.

In-Context Memory: The Simplest and Most Limiting

In-context memory is the contents of the LLM's active context window. It's the most natural form — the entire conversation history, relevant retrieved documents, tool results, and system instructions are all present simultaneously for the model to reason over. The advantages are obvious: zero latency (no external lookup required), no infrastructure, and full coherence (the model can attend to any part of the context freely). The limitations are equally obvious: capacity is bounded by the context window (8k to 200k tokens depending on the model), it's ephemeral (wiped when the process ends), and cost scales linearly with window size — at GPT-4o pricing, a 100k-token context costs roughly $0.15 per call, which compounds fast at scale. In-context memory is appropriate for single-session agents with bounded information needs. It becomes a bottleneck when: the agent runs across multiple sessions (the history must be externalized and reloaded, but reloading full history eats the context window quickly), the agent handles many users simultaneously (full per-user history in context becomes prohibitively expensive), or the agent's operational history grows beyond what fits in even large context windows. The practical pattern for managing in-context memory in production: implement a summary node that compresses older conversation history into a compact summary when message count exceeds a threshold. The detailed messages from the last N turns are kept verbatim; everything older is represented by the compressed summary. LangGraph's state model makes this explicit — the messages field reducer can be replaced with a custom reducer that applies compression logic.

External Key-Value Memory: Fast, Structured Working State

External key-value memory stores structured agent state outside the context window in a database (Redis, DynamoDB, PostgreSQL). Rather than keeping all state in the context, the agent reads specific fields it needs, operates on them, and writes updates back. This pattern is most powerful when agent state is well-structured and query patterns are predictable. A user preference store is a canonical example: the agent needs `user_prefs['communication_style']` and `user_prefs['preferred_language']` at the start of every response, but doesn't need the full preference object in context — a two-key lookup is sufficient. The implementation pattern is a thin persistence layer: `memory.get(user_id, key)` returns a stored value, `memory.set(user_id, key, value)` writes it. The agent's system prompt is dynamically assembled by loading relevant K/V fields and injecting them as formatted text before the conversation messages. Redis with a 30-day TTL is the standard infrastructure choice for this layer — sub-millisecond reads, horizontal scaling, and automatic expiry for inactive users. Key design discipline: keep the schema flat and explicit. Deeply nested K/V stores become hard to reason about and hard to migrate. Define an explicit schema (even just a TypedDict) for what fields exist and what their types are. The risk of K/V memory is staleness: if the agent writes incorrect values (from a hallucinated tool result or a misclassified entity), those incorrect values persist and pollute future sessions. Implement a confidence threshold — only persist values extracted with high confidence — and provide a clear mechanism for users to correct stored preferences.

Vector Episodic Memory: Semantic Recall Across Interactions

Vector episodic memory stores past experiences — conversation episodes, task outcomes, tool call sequences — as embeddings in a vector database, enabling semantic retrieval of relevant past events. Rather than exact lookups by key, episodic memory retrieval asks: what past experiences are most semantically similar to the current situation? A customer service agent with episodic memory can retrieve past interactions where this user reported the same issue, or successful resolution paths for this type of complaint — and use that context to guide its current response. The ingestion pattern: at the end of each conversation or task, a summarizer node distills the interaction into a compact episode record containing the outcome, the steps taken, entities involved, and a resolution status. This record is embedded and stored in a vector DB with rich metadata (user_id, timestamp, intent, resolution_success) for filtered retrieval. The retrieval pattern: at the start of a new interaction, embed the current query context and retrieve the top-3 to top-5 most similar past episodes, filtered by user_id and recency. These are injected into the prompt as relevant past context. The critical architecture decision is the granularity of episodes. Storing full conversation transcripts as single episodes loses retrieval precision. Storing individual message exchanges is too granular. The right granularity is a task-level summary: one episode per completed task or resolved query, capturing the essence of what happened without verbatim transcript. Mem0, the open-source memory layer, implements this pattern with automatic episode extraction and a clean Python API that wraps multiple vector DB backends.

Knowledge Graph Semantic Memory: Structured World Models

Knowledge graph memory represents the agent's world model as a structured graph of entities and relationships, enabling complex relational reasoning that vector search cannot support. Where episodic memory answers what happened before in situations like this, semantic memory answers what do I know about the structure of this domain? For example: a procurement agent's semantic memory might store that Vendor A is a subsidiary of Company B, that Company B has an active compliance hold, and that therefore any order from Vendor A requires additional approval — even though no past interaction with Vendor A has triggered this rule yet. This inferential capability requires a graph, not a vector store. Implementation options range from managed graph databases (Neo4j, Neptune, TigerGraph) for production deployments to lightweight in-process options (NetworkX for small graphs, SQLite with adjacency tables for medium-scale). The key design challenge is graph maintenance: who creates and updates the relationships? Options are manual curation (high quality, high cost), LLM-assisted extraction from documents and interactions (scalable, requires careful validation), or structured data import from existing enterprise systems (CRM, ERP). For most agent deployments, a hybrid approach works well: import known entity relationships from enterprise systems at build time, use LLM extraction to add new entities discovered during operation, and flag low-confidence extractions for human review. Knowledge graph memory is significantly more complex to operate than K/V or vector memory, so reserve it for agents where relational reasoning is a genuine requirement, not a theoretical capability.

Implementation Patterns Per Memory Type

Combining memory types requires an explicit memory access protocol in your agent design. A practical pattern for a multi-session enterprise agent uses three layers with defined read/write rules. At session start, the agent: reads from K/V memory to load user preferences and current session context (under 5ms), runs a vector episodic query to retrieve relevant past interactions (50–150ms), and assembles these into the initial system prompt context block. During task execution, the agent: reads and writes K/V memory for ephemeral working state (entities extracted mid-task, intermediate results), and reads the knowledge graph for relationship lookups when a tool result surfaces a new entity that needs context (100–300ms for graph traversal). At session end, the agent: writes a task episode to vector memory (async, off the critical path), updates K/V memory with any preference signals collected during the session, and if new entities were discovered, queues them for knowledge graph extraction review. The total memory overhead per session start is typically 150–300ms — acceptable for most use cases. If latency is critical, pre-warm episodic memory by triggering the vector query on session creation (when the user logs in) rather than on the first message, so results are ready before the first query arrives.

Multi-Agent Memory Sharing: Architecture and Pitfalls

When multiple agents share a memory layer, the complexity multiplies. The fundamental tension is between isolation (each agent has a coherent, consistent view of memory) and sharing (agents can learn from each other's experiences). Write conflicts are the primary failure mode: two agents simultaneously writing to the same K/V key with different values produces a last-write-wins race condition. For episodic and knowledge graph memory, concurrent writes can produce inconsistent graph states. The solutions map to familiar distributed systems patterns. For K/V memory: use optimistic concurrency control — each read returns a version number, each write includes the version number, and writes are rejected if the version has changed. For episodic memory: treat the vector store as append-only with tombstoning for corrections rather than in-place updates. This eliminates write conflicts at the cost of requiring deduplication at read time. For knowledge graph memory: use a transactional graph database (Neo4j 5.x supports ACID transactions) and require that all graph mutations go through a single graph maintenance agent rather than allowing arbitrary graph writes from all agents. The multi-agent memory sharing architecture should also define a clear namespace scheme: all memory keys should be prefixed with an agent identifier so that agent-specific memories don't collide with shared memories. CrewAI's built-in memory system implements per-agent memory isolation with an explicit shared memory namespace — reviewing its design is useful even if you're not using CrewAI directly.

Related Resources

Find agencies that specialize in the frameworks and use cases covered in this article.

Related Articles
Explore the Directory

Find the right AI agent agency for your project.

← Back to Blog