Memory for AI Agents: Short-term, Long-term, and RAG
Every conversation with ChatGPT starts fresh. It doesn't remember you, your preferences, or your previous conversations. For a chatbot, that's fine. For an agent that's supposed to work with you over time? It's a fatal flaw.
Memory transforms agents from stateless tools into intelligent assistants that learn, adapt, and improve.
This guide shows you how to implement memory in AI agents—from simple conversation buffers to sophisticated retrieval systems that give agents access to vast knowledge bases.
Why Agents Need Memory
Without memory, agents:
- Forget context mid-conversation
- Can't learn from past mistakes
- Have no access to private knowledge
- Repeat the same errors endlessly
- Can't personalize to users
With memory, agents:
- Maintain context across sessions
- Learn from experience
- Access company knowledge bases
- Improve over time
- Personalize responses
The Three Types of Agent Memory
| 1 | ┌─────────────────────────────────────────────────────────────┐ |
| 2 | │ AGENT MEMORY │ |
| 3 | ├─────────────────────────────────────────────────────────────┤ |
| 4 | │ │ |
| 5 | │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │ |
| 6 | │ │ SHORT-TERM │ │ LONG-TERM │ │ EXTERNAL │ │ |
| 7 | │ │ MEMORY │ │ MEMORY │ │ KNOWLEDGE │ │ |
| 8 | │ ├─────────────────┤ ├─────────────────┤ ├─────────────┤ │ |
| 9 | │ │ │ │ │ │ │ │ |
| 10 | │ │ • Context window│ │ • Past sessions │ │ • Documents │ │ |
| 11 | │ │ • Current chat │ │ • User prefs │ │ • Databases │ │ |
| 12 | │ │ • Working state │ │ • Learned facts │ │ • APIs │ │ |
| 13 | │ │ │ │ • Experiences │ │ • Web │ │ |
| 14 | │ │ │ │ │ │ │ │ |
| 15 | │ │ Volatile │ │ Persistent │ │ Retrieved │ │ |
| 16 | │ │ ~128K tokens │ │ Unlimited │ │ On-demand │ │ |
| 17 | │ │ │ │ │ │ │ │ |
| 18 | │ └─────────────────┘ └─────────────────┘ └─────────────┘ │ |
| 19 | │ │ |
| 20 | └─────────────────────────────────────────────────────────────┘ |
| 21 | |
1. Short-Term Memory (Context Window)
The conversation history within a single session. Limited by the model's context window (4K to 128K+ tokens).
2. Long-Term Memory (Persistent)
Information that persists across sessions—user preferences, past interactions, learned facts. Stored externally and retrieved when needed.
3. External Knowledge (RAG)
Access to documents, databases, and knowledge bases that weren't in the model's training data. Retrieved dynamically based on the current query.
Short-Term Memory: Managing Context
Basic Conversation Buffer
The simplest memory—just keep the full conversation:
| 1 | class ConversationBuffer: |
| 2 | def __init__(self, max_tokens: int = 8000): |
| 3 | self.messages = [] |
| 4 | self.max_tokens = max_tokens |
| 5 | |
| 6 | def add(self, role: str, content: str): |
| 7 | self.messages.append({"role": role, "content": content}) |
| 8 | self._trim_if_needed() |
| 9 | |
| 10 | def _trim_if_needed(self): |
| 11 | """Remove oldest messages if we exceed token limit""" |
| 12 | while self._estimate_tokens() > self.max_tokens and len(self.messages) > 1: |
| 13 | # Keep system message, remove oldest user/assistant pair |
| 14 | if self.messages[0]["role"] == "system": |
| 15 | self.messages.pop(1) |
| 16 | else: |
| 17 | self.messages.pop(0) |
| 18 | |
| 19 | def _estimate_tokens(self) -> int: |
| 20 | # Rough estimate: 4 chars per token |
| 21 | return sum(len(m["content"]) // 4 for m in self.messages) |
| 22 | |
| 23 | def get_messages(self) -> list: |
| 24 | return self.messages.copy() |
| 25 | |
| 26 | |
| 27 | # Usage |
| 28 | memory = ConversationBuffer() |
| 29 | memory.add("system", "You are a helpful assistant.") |
| 30 | memory.add("user", "What's the capital of France?") |
| 31 | memory.add("assistant", "The capital of France is Paris.") |
| 32 | memory.add("user", "What's its population?") # Agent remembers we're talking about Paris |
| 33 | |
Sliding Window with Summary
For longer conversations, summarize old messages instead of discarding them:
| 1 | import openai |
| 2 | |
| 3 | class SummarizingMemory: |
| 4 | def __init__(self, window_size: int = 10, max_tokens: int = 4000): |
| 5 | self.client = openai.OpenAI() |
| 6 | self.messages = [] |
| 7 | self.summary = "" |
| 8 | self.window_size = window_size |
| 9 | self.max_tokens = max_tokens |
| 10 | |
| 11 | def add(self, role: str, content: str): |
| 12 | self.messages.append({"role": role, "content": content}) |
| 13 | |
| 14 | # Summarize when window is exceeded |
| 15 | if len(self.messages) > self.window_size * 2: |
| 16 | self._summarize_old_messages() |
| 17 | |
| 18 | def _summarize_old_messages(self): |
| 19 | """Compress old messages into summary""" |
| 20 | # Take oldest half of messages |
| 21 | to_summarize = self.messages[:self.window_size] |
| 22 | self.messages = self.messages[self.window_size:] |
| 23 | |
| 24 | # Generate summary |
| 25 | summary_prompt = f"""Summarize this conversation, preserving key facts and decisions: |
| 26 | |
| 27 | Previous summary: {self.summary} |
| 28 | |
| 29 | New messages: |
| 30 | {self._format_messages(to_summarize)} |
| 31 | |
| 32 | Provide a concise summary.""" |
| 33 | |
| 34 | response = self.client.chat.completions.create( |
| 35 | model="gpt-4o-mini", # Use cheaper model for summarization |
| 36 | messages=[{"role": "user", "content": summary_prompt}] |
| 37 | ) |
| 38 | |
| 39 | self.summary = response.choices[0].message.content |
| 40 | |
| 41 | def get_messages(self) -> list: |
| 42 | """Get messages with summary as context""" |
| 43 | result = [] |
| 44 | |
| 45 | if self.summary: |
| 46 | result.append({ |
| 47 | "role": "system", |
| 48 | "content": f"Previous conversation summary:\n{self.summary}" |
| 49 | }) |
| 50 | |
| 51 | result.extend(self.messages) |
| 52 | return result |
| 53 | |
| 54 | def _format_messages(self, messages: list) -> str: |
| 55 | return "\n".join(f"{m['role']}: {m['content']}" for m in messages) |
| 56 | |
Working Memory for Multi-Step Tasks
For agents executing multi-step tasks, maintain structured working memory:
| 1 | from dataclasses import dataclass, field |
| 2 | from typing import Any |
| 3 | |
| 4 | @dataclass |
| 5 | class WorkingMemory: |
| 6 | """Structured memory for task execution""" |
| 7 | goal: str = "" |
| 8 | current_step: int = 0 |
| 9 | plan: list[str] = field(default_factory=list) |
| 10 | completed_steps: list[dict] = field(default_factory=list) |
| 11 | variables: dict[str, Any] = field(default_factory=dict) |
| 12 | errors: list[str] = field(default_factory=list) |
| 13 | |
| 14 | def to_context(self) -> str: |
| 15 | """Convert to context string for LLM""" |
| 16 | return f"""Current Task State: |
| 17 | Goal: {self.goal} |
| 18 | Progress: Step {self.current_step + 1} of {len(self.plan)} |
| 19 | |
| 20 | Plan: |
| 21 | {self._format_plan()} |
| 22 | |
| 23 | Variables: |
| 24 | {self._format_variables()} |
| 25 | |
| 26 | Recent Errors: {self.errors[-3:] if self.errors else 'None'} |
| 27 | """ |
| 28 | |
| 29 | def _format_plan(self) -> str: |
| 30 | lines = [] |
| 31 | for i, step in enumerate(self.plan): |
| 32 | status = "✓" if i < self.current_step else "→" if i == self.current_step else " " |
| 33 | lines.append(f" [{status}] {i+1}. {step}") |
| 34 | return "\n".join(lines) |
| 35 | |
| 36 | def _format_variables(self) -> str: |
| 37 | if not self.variables: |
| 38 | return " (none)" |
| 39 | return "\n".join(f" {k}: {v}" for k, v in self.variables.items()) |
| 40 | |
| 41 | |
| 42 | # Usage in agent |
| 43 | class TaskAgent: |
| 44 | def __init__(self): |
| 45 | self.working_memory = WorkingMemory() |
| 46 | |
| 47 | def execute(self, goal: str): |
| 48 | self.working_memory.goal = goal |
| 49 | self.working_memory.plan = self._create_plan(goal) |
| 50 | |
| 51 | for i, step in enumerate(self.working_memory.plan): |
| 52 | self.working_memory.current_step = i |
| 53 | |
| 54 | # Include working memory in context |
| 55 | context = self.working_memory.to_context() |
| 56 | result = self._execute_step(step, context) |
| 57 | |
| 58 | self.working_memory.completed_steps.append({ |
| 59 | "step": step, |
| 60 | "result": result |
| 61 | }) |
| 62 | |
| 63 | # Store results as variables for later steps |
| 64 | if "output" in result: |
| 65 | self.working_memory.variables[f"step_{i}_output"] = result["output"] |
| 66 | |
Long-Term Memory: Persistence Across Sessions
Vector Database for Semantic Search
The most common approach—store memories as embeddings and retrieve by semantic similarity:
| 1 | import openai |
| 2 | import numpy as np |
| 3 | from dataclasses import dataclass |
| 4 | from datetime import datetime |
| 5 | |
| 6 | @dataclass |
| 7 | class Memory: |
| 8 | content: str |
| 9 | embedding: list[float] |
| 10 | metadata: dict |
| 11 | timestamp: datetime |
| 12 | |
| 13 | class VectorMemory: |
| 14 | def __init__(self): |
| 15 | self.client = openai.OpenAI() |
| 16 | self.memories: list[Memory] = [] |
| 17 | |
| 18 | def add(self, content: str, metadata: dict = None): |
| 19 | """Store a memory with its embedding""" |
| 20 | embedding = self._get_embedding(content) |
| 21 | |
| 22 | memory = Memory( |
| 23 | content=content, |
| 24 | embedding=embedding, |
| 25 | metadata=metadata or {}, |
| 26 | timestamp=datetime.now() |
| 27 | ) |
| 28 | |
| 29 | self.memories.append(memory) |
| 30 | |
| 31 | def search(self, query: str, top_k: int = 5) -> list[Memory]: |
| 32 | """Find memories most relevant to the query""" |
| 33 | query_embedding = self._get_embedding(query) |
| 34 | |
| 35 | # Calculate similarities |
| 36 | similarities = [] |
| 37 | for memory in self.memories: |
| 38 | sim = self._cosine_similarity(query_embedding, memory.embedding) |
| 39 | similarities.append((memory, sim)) |
| 40 | |
| 41 | # Sort by similarity and return top_k |
| 42 | similarities.sort(key=lambda x: x[1], reverse=True) |
| 43 | return [m for m, _ in similarities[:top_k]] |
| 44 | |
| 45 | def _get_embedding(self, text: str) -> list[float]: |
| 46 | response = self.client.embeddings.create( |
| 47 | model="text-embedding-3-small", |
| 48 | input=text |
| 49 | ) |
| 50 | return response.data[0].embedding |
| 51 | |
| 52 | def _cosine_similarity(self, a: list, b: list) -> float: |
| 53 | a = np.array(a) |
| 54 | b = np.array(b) |
| 55 | return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) |
| 56 | |
| 57 | |
| 58 | # Usage |
| 59 | memory = VectorMemory() |
| 60 | |
| 61 | # Store memories |
| 62 | memory.add("User prefers Python over JavaScript", {"type": "preference"}) |
| 63 | memory.add("User's project is an e-commerce platform", {"type": "context"}) |
| 64 | memory.add("User had trouble with authentication last week", {"type": "issue"}) |
| 65 | |
| 66 | # Retrieve relevant memories |
| 67 | relevant = memory.search("What programming language should I use?") |
| 68 | # Returns: "User prefers Python over JavaScript" |
| 69 | |
Production Vector Store with Pinecone/Weaviate
For production, use a managed vector database:
| 1 | import pinecone |
| 2 | from pinecone import Pinecone |
| 3 | import openai |
| 4 | |
| 5 | class ProductionMemory: |
| 6 | def __init__(self, index_name: str): |
| 7 | self.pc = Pinecone(api_key="your-api-key") |
| 8 | self.index = self.pc.Index(index_name) |
| 9 | self.openai = openai.OpenAI() |
| 10 | |
| 11 | def add(self, memory_id: str, content: str, metadata: dict = None): |
| 12 | """Store memory in Pinecone""" |
| 13 | embedding = self._get_embedding(content) |
| 14 | |
| 15 | self.index.upsert(vectors=[{ |
| 16 | "id": memory_id, |
| 17 | "values": embedding, |
| 18 | "metadata": { |
| 19 | "content": content, |
| 20 | **(metadata or {}) |
| 21 | } |
| 22 | }]) |
| 23 | |
| 24 | def search(self, query: str, top_k: int = 5, filter: dict = None) -> list[dict]: |
| 25 | """Search memories with optional filtering""" |
| 26 | query_embedding = self._get_embedding(query) |
| 27 | |
| 28 | results = self.index.query( |
| 29 | vector=query_embedding, |
| 30 | top_k=top_k, |
| 31 | filter=filter, |
| 32 | include_metadata=True |
| 33 | ) |
| 34 | |
| 35 | return [ |
| 36 | { |
| 37 | "id": match.id, |
| 38 | "score": match.score, |
| 39 | "content": match.metadata.get("content"), |
| 40 | "metadata": match.metadata |
| 41 | } |
| 42 | for match in results.matches |
| 43 | ] |
| 44 | |
| 45 | def delete(self, memory_id: str): |
| 46 | """Remove a memory""" |
| 47 | self.index.delete(ids=[memory_id]) |
| 48 | |
| 49 | def _get_embedding(self, text: str) -> list[float]: |
| 50 | response = self.openai.embeddings.create( |
| 51 | model="text-embedding-3-small", |
| 52 | input=text |
| 53 | ) |
| 54 | return response.data[0].embedding |
| 55 | |
| 56 | |
| 57 | # Usage with user-specific memories |
| 58 | memory = ProductionMemory("agent-memories") |
| 59 | |
| 60 | # Store user-specific memory |
| 61 | memory.add( |
| 62 | memory_id="user_123_pref_1", |
| 63 | content="User prefers detailed technical explanations", |
| 64 | metadata={"user_id": "123", "type": "preference"} |
| 65 | ) |
| 66 | |
| 67 | # Search only this user's memories |
| 68 | results = memory.search( |
| 69 | query="How should I explain this concept?", |
| 70 | filter={"user_id": "123"} |
| 71 | ) |
| 72 | |
Structured Long-Term Memory
For specific types of information, use structured storage:
| 1 | import json |
| 2 | from datetime import datetime |
| 3 | from pathlib import Path |
| 4 | |
| 5 | class StructuredMemory: |
| 6 | def __init__(self, storage_path: str): |
| 7 | self.path = Path(storage_path) |
| 8 | self.path.mkdir(parents=True, exist_ok=True) |
| 9 | |
| 10 | def get_user_profile(self, user_id: str) -> dict: |
| 11 | """Get or create user profile""" |
| 12 | profile_path = self.path / f"user_{user_id}.json" |
| 13 | |
| 14 | if profile_path.exists(): |
| 15 | return json.loads(profile_path.read_text()) |
| 16 | |
| 17 | return { |
| 18 | "user_id": user_id, |
| 19 | "created_at": datetime.now().isoformat(), |
| 20 | "preferences": {}, |
| 21 | "facts": [], |
| 22 | "interaction_count": 0 |
| 23 | } |
| 24 | |
| 25 | def update_user_profile(self, user_id: str, updates: dict): |
| 26 | """Update user profile""" |
| 27 | profile = self.get_user_profile(user_id) |
| 28 | profile.update(updates) |
| 29 | profile["updated_at"] = datetime.now().isoformat() |
| 30 | |
| 31 | profile_path = self.path / f"user_{user_id}.json" |
| 32 | profile_path.write_text(json.dumps(profile, indent=2)) |
| 33 | |
| 34 | def add_fact(self, user_id: str, fact: str, source: str = None): |
| 35 | """Store a learned fact about the user""" |
| 36 | profile = self.get_user_profile(user_id) |
| 37 | |
| 38 | profile["facts"].append({ |
| 39 | "fact": fact, |
| 40 | "learned_at": datetime.now().isoformat(), |
| 41 | "source": source |
| 42 | }) |
| 43 | |
| 44 | self.update_user_profile(user_id, profile) |
| 45 | |
| 46 | def add_preference(self, user_id: str, key: str, value: str): |
| 47 | """Store a user preference""" |
| 48 | profile = self.get_user_profile(user_id) |
| 49 | profile["preferences"][key] = value |
| 50 | self.update_user_profile(user_id, profile) |
| 51 | |
| 52 | |
| 53 | # Usage |
| 54 | memory = StructuredMemory("./agent_memory") |
| 55 | |
| 56 | # Learn about user |
| 57 | memory.add_fact("user_123", "Works at a fintech startup") |
| 58 | memory.add_preference("user_123", "communication_style", "concise") |
| 59 | memory.add_preference("user_123", "expertise_level", "senior developer") |
| 60 | |
| 61 | # Later, personalize responses |
| 62 | profile = memory.get_user_profile("user_123") |
| 63 | # Use profile["preferences"]["communication_style"] to adjust response length |
| 64 | |
RAG: Retrieval Augmented Generation
RAG gives agents access to knowledge beyond their training:
| 1 | ┌─────────────────────────────────────────────────────────────┐ |
| 2 | │ RAG Pipeline │ |
| 3 | ├─────────────────────────────────────────────────────────────┤ |
| 4 | │ │ |
| 5 | │ User Query │ |
| 6 | │ │ │ |
| 7 | │ ▼ │ |
| 8 | │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ |
| 9 | │ │ Embed │───▶│ Search │───▶│ Retrieve │ │ |
| 10 | │ │ Query │ │ Vector DB │ │ Documents │ │ |
| 11 | │ └─────────────┘ └─────────────┘ └──────┬──────┘ │ |
| 12 | │ │ │ |
| 13 | │ ▼ │ |
| 14 | │ ┌─────────────────────────────────────────────────────┐ │ |
| 15 | │ │ LLM Prompt │ │ |
| 16 | │ │ │ │ |
| 17 | │ │ Context: [Retrieved documents] │ │ |
| 18 | │ │ Question: [User query] │ │ |
| 19 | │ │ Answer based on the context above. │ │ |
| 20 | │ │ │ │ |
| 21 | │ └─────────────────────────────────────────────────────┘ │ |
| 22 | │ │ │ |
| 23 | │ ▼ │ |
| 24 | │ Response │ |
| 25 | │ │ |
| 26 | └─────────────────────────────────────────────────────────────┘ |
| 27 | |
Basic RAG Implementation
| 1 | import openai |
| 2 | from dataclasses import dataclass |
| 3 | |
| 4 | @dataclass |
| 5 | class Document: |
| 6 | content: str |
| 7 | metadata: dict |
| 8 | embedding: list[float] = None |
| 9 | |
| 10 | class RAGAgent: |
| 11 | def __init__(self): |
| 12 | self.client = openai.OpenAI() |
| 13 | self.documents: list[Document] = [] |
| 14 | |
| 15 | def add_documents(self, docs: list[str], metadata: list[dict] = None): |
| 16 | """Index documents for retrieval""" |
| 17 | for i, content in enumerate(docs): |
| 18 | embedding = self._get_embedding(content) |
| 19 | doc = Document( |
| 20 | content=content, |
| 21 | metadata=metadata[i] if metadata else {}, |
| 22 | embedding=embedding |
| 23 | ) |
| 24 | self.documents.append(doc) |
| 25 | |
| 26 | def query(self, question: str, top_k: int = 3) -> str: |
| 27 | """Answer question using retrieved context""" |
| 28 | |
| 29 | # Step 1: Retrieve relevant documents |
| 30 | relevant_docs = self._retrieve(question, top_k) |
| 31 | |
| 32 | # Step 2: Build context |
| 33 | context = "\n\n---\n\n".join([doc.content for doc in relevant_docs]) |
| 34 | |
| 35 | # Step 3: Generate answer |
| 36 | response = self.client.chat.completions.create( |
| 37 | model="gpt-4o", |
| 38 | messages=[{ |
| 39 | "role": "system", |
| 40 | "content": """Answer the question based on the provided context. |
| 41 | If the context doesn't contain relevant information, say so. |
| 42 | Cite sources when possible.""" |
| 43 | }, { |
| 44 | "role": "user", |
| 45 | "content": f"""Context: |
| 46 | {context} |
| 47 | |
| 48 | Question: {question}""" |
| 49 | }] |
| 50 | ) |
| 51 | |
| 52 | return response.choices[0].message.content |
| 53 | |
| 54 | def _retrieve(self, query: str, top_k: int) -> list[Document]: |
| 55 | """Find most relevant documents""" |
| 56 | query_embedding = self._get_embedding(query) |
| 57 | |
| 58 | scored = [] |
| 59 | for doc in self.documents: |
| 60 | similarity = self._cosine_similarity(query_embedding, doc.embedding) |
| 61 | scored.append((doc, similarity)) |
| 62 | |
| 63 | scored.sort(key=lambda x: x[1], reverse=True) |
| 64 | return [doc for doc, _ in scored[:top_k]] |
| 65 | |
| 66 | def _get_embedding(self, text: str) -> list[float]: |
| 67 | response = self.client.embeddings.create( |
| 68 | model="text-embedding-3-small", |
| 69 | input=text |
| 70 | ) |
| 71 | return response.data[0].embedding |
| 72 | |
| 73 | def _cosine_similarity(self, a, b): |
| 74 | import numpy as np |
| 75 | a, b = np.array(a), np.array(b) |
| 76 | return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) |
| 77 | |
| 78 | |
| 79 | # Usage |
| 80 | agent = RAGAgent() |
| 81 | |
| 82 | # Index company documentation |
| 83 | agent.add_documents([ |
| 84 | "Our API rate limit is 100 requests per minute for free tier users.", |
| 85 | "Premium users get 1000 requests per minute and priority support.", |
| 86 | "To upgrade, visit settings > billing > upgrade plan.", |
| 87 | "API keys can be rotated in settings > security > API keys." |
| 88 | ]) |
| 89 | |
| 90 | # Answer questions using documentation |
| 91 | answer = agent.query("How many API requests can I make?") |
| 92 | print(answer) |
| 93 | # "Based on your tier: Free users can make 100 requests/minute, |
| 94 | # Premium users can make 1000 requests/minute..." |
| 95 | |
Advanced RAG with Chunking and Re-ranking
| 1 | from hopx import Sandbox |
| 2 | import openai |
| 3 | |
| 4 | class AdvancedRAG: |
| 5 | def __init__(self): |
| 6 | self.client = openai.OpenAI() |
| 7 | self.chunks = [] |
| 8 | |
| 9 | def index_document(self, content: str, chunk_size: int = 500, overlap: int = 50): |
| 10 | """Split document into overlapping chunks and index""" |
| 11 | chunks = self._chunk_text(content, chunk_size, overlap) |
| 12 | |
| 13 | for i, chunk in enumerate(chunks): |
| 14 | embedding = self._get_embedding(chunk) |
| 15 | self.chunks.append({ |
| 16 | "id": f"chunk_{len(self.chunks)}", |
| 17 | "content": chunk, |
| 18 | "embedding": embedding, |
| 19 | "position": i |
| 20 | }) |
| 21 | |
| 22 | def query(self, question: str, top_k: int = 5) -> str: |
| 23 | # Step 1: Initial retrieval |
| 24 | candidates = self._retrieve(question, top_k * 2) |
| 25 | |
| 26 | # Step 2: Re-rank with LLM |
| 27 | reranked = self._rerank(question, candidates, top_k) |
| 28 | |
| 29 | # Step 3: Generate with best context |
| 30 | context = "\n\n".join([c["content"] for c in reranked]) |
| 31 | |
| 32 | return self._generate_answer(question, context) |
| 33 | |
| 34 | def _chunk_text(self, text: str, size: int, overlap: int) -> list[str]: |
| 35 | """Split text into overlapping chunks""" |
| 36 | words = text.split() |
| 37 | chunks = [] |
| 38 | |
| 39 | for i in range(0, len(words), size - overlap): |
| 40 | chunk = " ".join(words[i:i + size]) |
| 41 | if chunk: |
| 42 | chunks.append(chunk) |
| 43 | |
| 44 | return chunks |
| 45 | |
| 46 | def _retrieve(self, query: str, top_k: int) -> list[dict]: |
| 47 | """Vector similarity search""" |
| 48 | query_embedding = self._get_embedding(query) |
| 49 | |
| 50 | scored = [] |
| 51 | for chunk in self.chunks: |
| 52 | sim = self._cosine_similarity(query_embedding, chunk["embedding"]) |
| 53 | scored.append({**chunk, "score": sim}) |
| 54 | |
| 55 | scored.sort(key=lambda x: x["score"], reverse=True) |
| 56 | return scored[:top_k] |
| 57 | |
| 58 | def _rerank(self, query: str, candidates: list[dict], top_k: int) -> list[dict]: |
| 59 | """Use LLM to rerank candidates""" |
| 60 | # Format candidates for reranking |
| 61 | candidate_text = "\n".join([ |
| 62 | f"[{i}] {c['content'][:200]}..." |
| 63 | for i, c in enumerate(candidates) |
| 64 | ]) |
| 65 | |
| 66 | response = self.client.chat.completions.create( |
| 67 | model="gpt-4o-mini", |
| 68 | messages=[{ |
| 69 | "role": "user", |
| 70 | "content": f"""Rank these passages by relevance to the question. |
| 71 | Return only the indices of the top {top_k} most relevant, in order. |
| 72 | |
| 73 | Question: {query} |
| 74 | |
| 75 | Passages: |
| 76 | {candidate_text} |
| 77 | |
| 78 | Return format: 3, 1, 5, 2, 4""" |
| 79 | }] |
| 80 | ) |
| 81 | |
| 82 | # Parse ranking |
| 83 | try: |
| 84 | indices = [int(x.strip()) for x in response.choices[0].message.content.split(",")] |
| 85 | return [candidates[i] for i in indices[:top_k] if i < len(candidates)] |
| 86 | except: |
| 87 | return candidates[:top_k] |
| 88 | |
| 89 | def _generate_answer(self, question: str, context: str) -> str: |
| 90 | response = self.client.chat.completions.create( |
| 91 | model="gpt-4o", |
| 92 | messages=[{ |
| 93 | "role": "system", |
| 94 | "content": "Answer based on the context. Be precise and cite relevant parts." |
| 95 | }, { |
| 96 | "role": "user", |
| 97 | "content": f"Context:\n{context}\n\nQuestion: {question}" |
| 98 | }] |
| 99 | ) |
| 100 | return response.choices[0].message.content |
| 101 | |
Combining Memory Types
A complete agent uses all three memory types:
| 1 | import openai |
| 2 | from datetime import datetime |
| 3 | |
| 4 | class MemoryEnabledAgent: |
| 5 | def __init__(self, user_id: str): |
| 6 | self.client = openai.OpenAI() |
| 7 | self.user_id = user_id |
| 8 | |
| 9 | # Short-term: Current conversation |
| 10 | self.conversation = SummarizingMemory() |
| 11 | |
| 12 | # Long-term: User-specific memories |
| 13 | self.user_memory = VectorMemory() |
| 14 | |
| 15 | # External: Knowledge base |
| 16 | self.knowledge_base = RAGAgent() |
| 17 | |
| 18 | # Load user profile |
| 19 | self.profile = self._load_profile() |
| 20 | |
| 21 | def chat(self, message: str) -> str: |
| 22 | # Add user message to short-term memory |
| 23 | self.conversation.add("user", message) |
| 24 | |
| 25 | # Retrieve relevant long-term memories |
| 26 | relevant_memories = self.user_memory.search(message, top_k=3) |
| 27 | memory_context = "\n".join([m.content for m in relevant_memories]) |
| 28 | |
| 29 | # Retrieve relevant knowledge |
| 30 | knowledge_context = "" |
| 31 | if self._needs_knowledge(message): |
| 32 | knowledge_results = self.knowledge_base._retrieve(message, top_k=3) |
| 33 | knowledge_context = "\n".join([k["content"] for k in knowledge_results]) |
| 34 | |
| 35 | # Build system prompt with context |
| 36 | system_prompt = self._build_system_prompt(memory_context, knowledge_context) |
| 37 | |
| 38 | # Generate response |
| 39 | messages = [{"role": "system", "content": system_prompt}] |
| 40 | messages.extend(self.conversation.get_messages()) |
| 41 | |
| 42 | response = self.client.chat.completions.create( |
| 43 | model="gpt-4o", |
| 44 | messages=messages |
| 45 | ) |
| 46 | |
| 47 | assistant_message = response.choices[0].message.content |
| 48 | |
| 49 | # Add to short-term memory |
| 50 | self.conversation.add("assistant", assistant_message) |
| 51 | |
| 52 | # Extract and store any new facts about user |
| 53 | self._extract_and_store_facts(message, assistant_message) |
| 54 | |
| 55 | return assistant_message |
| 56 | |
| 57 | def _build_system_prompt(self, memories: str, knowledge: str) -> str: |
| 58 | prompt = f"""You are a helpful AI assistant with memory. |
| 59 | |
| 60 | User Profile: |
| 61 | - Name: {self.profile.get('name', 'Unknown')} |
| 62 | - Preferences: {self.profile.get('preferences', {})} |
| 63 | |
| 64 | Relevant memories about this user: |
| 65 | {memories if memories else '(No relevant memories)'} |
| 66 | |
| 67 | Relevant knowledge: |
| 68 | {knowledge if knowledge else '(No external knowledge needed)'} |
| 69 | |
| 70 | Use this context to personalize your responses.""" |
| 71 | |
| 72 | return prompt |
| 73 | |
| 74 | def _needs_knowledge(self, message: str) -> bool: |
| 75 | """Determine if we need to search knowledge base""" |
| 76 | knowledge_triggers = ["how do", "what is", "explain", "help me", "documentation"] |
| 77 | return any(trigger in message.lower() for trigger in knowledge_triggers) |
| 78 | |
| 79 | def _extract_and_store_facts(self, user_msg: str, assistant_msg: str): |
| 80 | """Extract facts from conversation to store in long-term memory""" |
| 81 | extraction_prompt = f"""Extract any new facts about the user from this exchange. |
| 82 | Return JSON: {{"facts": ["fact1", "fact2"]}} or {{"facts": []}} if none. |
| 83 | |
| 84 | User: {user_msg} |
| 85 | Assistant: {assistant_msg}""" |
| 86 | |
| 87 | response = self.client.chat.completions.create( |
| 88 | model="gpt-4o-mini", |
| 89 | messages=[{"role": "user", "content": extraction_prompt}], |
| 90 | response_format={"type": "json_object"} |
| 91 | ) |
| 92 | |
| 93 | import json |
| 94 | result = json.loads(response.choices[0].message.content) |
| 95 | |
| 96 | for fact in result.get("facts", []): |
| 97 | self.user_memory.add( |
| 98 | content=fact, |
| 99 | metadata={ |
| 100 | "user_id": self.user_id, |
| 101 | "extracted_at": datetime.now().isoformat() |
| 102 | } |
| 103 | ) |
| 104 | |
| 105 | |
| 106 | # Usage |
| 107 | agent = MemoryEnabledAgent(user_id="user_123") |
| 108 | |
| 109 | # First conversation |
| 110 | agent.chat("Hi! I'm a Python developer working on machine learning projects.") |
| 111 | agent.chat("I prefer concise explanations.") |
| 112 | |
| 113 | # Later session - agent remembers! |
| 114 | agent.chat("Can you help me with my code?") |
| 115 | # Agent responds knowing user is a Python ML developer who prefers concise answers |
| 116 | |
Memory with Code Execution
For agents that execute code, persist state across executions:
| 1 | from hopx import Sandbox |
| 2 | import json |
| 3 | |
| 4 | class StatefulCodeAgent: |
| 5 | def __init__(self, session_id: str): |
| 6 | self.session_id = session_id |
| 7 | self.sandbox = None |
| 8 | self.state_file = f"/app/state_{session_id}.json" |
| 9 | |
| 10 | def start_session(self): |
| 11 | """Create sandbox and restore state""" |
| 12 | self.sandbox = Sandbox.create(template="code-interpreter") |
| 13 | |
| 14 | # Check for existing state |
| 15 | try: |
| 16 | state_content = self.sandbox.files.read(self.state_file) |
| 17 | self.state = json.loads(state_content) |
| 18 | print(f"Restored state with {len(self.state.get('variables', {}))} variables") |
| 19 | except: |
| 20 | self.state = {"variables": {}, "history": []} |
| 21 | |
| 22 | def execute(self, code: str) -> str: |
| 23 | """Execute code and persist state""" |
| 24 | |
| 25 | # Inject state restoration |
| 26 | setup_code = f""" |
| 27 | import json |
| 28 | |
| 29 | # Restore variables from previous session |
| 30 | _state = {json.dumps(self.state.get('variables', {}))} |
| 31 | globals().update(_state) |
| 32 | """ |
| 33 | |
| 34 | # Wrap code to capture new variables |
| 35 | wrapped_code = f""" |
| 36 | {setup_code} |
| 37 | |
| 38 | # User code |
| 39 | {code} |
| 40 | |
| 41 | # Capture state |
| 42 | import json |
| 43 | _new_state = {{k: v for k, v in globals().items() |
| 44 | if not k.startswith('_') and k not in ['json', 'builtins'] |
| 45 | and isinstance(v, (int, float, str, list, dict, bool))}} |
| 46 | with open('{self.state_file}', 'w') as f: |
| 47 | json.dump({{'variables': _new_state}}, f) |
| 48 | """ |
| 49 | |
| 50 | self.sandbox.files.write("/app/code.py", wrapped_code) |
| 51 | result = self.sandbox.commands.run("python /app/code.py") |
| 52 | |
| 53 | # Update local state |
| 54 | try: |
| 55 | state_content = self.sandbox.files.read(self.state_file) |
| 56 | self.state = json.loads(state_content) |
| 57 | except: |
| 58 | pass |
| 59 | |
| 60 | return result.stdout if result.exit_code == 0 else f"Error: {result.stderr}" |
| 61 | |
| 62 | def get_variables(self) -> dict: |
| 63 | """Get current session variables""" |
| 64 | return self.state.get("variables", {}) |
| 65 | |
| 66 | def end_session(self): |
| 67 | """Clean up but persist state for next session""" |
| 68 | if self.sandbox: |
| 69 | # State is already persisted in sandbox |
| 70 | self.sandbox.kill() |
| 71 | |
| 72 | |
| 73 | # Usage |
| 74 | agent = StatefulCodeAgent("session_abc123") |
| 75 | agent.start_session() |
| 76 | |
| 77 | # First execution |
| 78 | agent.execute("x = 10\ny = 20\nprint(x + y)") # Output: 30 |
| 79 | |
| 80 | # Second execution - variables persist! |
| 81 | agent.execute("print(x * y)") # Output: 200 |
| 82 | |
| 83 | # Check what's stored |
| 84 | print(agent.get_variables()) # {'x': 10, 'y': 20} |
| 85 | |
| 86 | agent.end_session() |
| 87 | |
Best Practices
1. Separate Memory Concerns
| 1 | # ❌ Don't: Mixing all memory in one place |
| 2 | memory = {"conversation": [...], "user_facts": [...], "documents": [...]} |
| 3 | |
| 4 | # ✅ Do: Separate by type and lifecycle |
| 5 | class AgentMemory: |
| 6 | def __init__(self): |
| 7 | self.short_term = ConversationBuffer() # Per-session |
| 8 | self.long_term = VectorMemory() # Persistent |
| 9 | self.knowledge = RAGAgent() # External |
| 10 | |
2. Implement Memory Decay
| 1 | def search_with_decay(self, query: str, decay_days: int = 30): |
| 2 | """Recent memories are weighted higher""" |
| 3 | from datetime import datetime, timedelta |
| 4 | |
| 5 | results = self.search(query) |
| 6 | |
| 7 | now = datetime.now() |
| 8 | for result in results: |
| 9 | age_days = (now - result.timestamp).days |
| 10 | decay_factor = max(0.5, 1 - (age_days / decay_days)) |
| 11 | result.score *= decay_factor |
| 12 | |
| 13 | return sorted(results, key=lambda x: x.score, reverse=True) |
| 14 | |
3. Limit Memory Scope
| 1 | # Filter memories by relevance |
| 2 | def get_relevant_memories(self, query: str, context: str): |
| 3 | all_memories = self.search(query) |
| 4 | |
| 5 | # Only include highly relevant memories |
| 6 | return [m for m in all_memories if m.score > 0.7] |
| 7 | |
4. Handle Memory Conflicts
| 1 | def add_with_conflict_resolution(self, fact: str): |
| 2 | # Check for conflicting memories |
| 3 | similar = self.search(fact, top_k=3) |
| 4 | |
| 5 | for existing in similar: |
| 6 | if self._is_contradiction(fact, existing.content): |
| 7 | # New information replaces old |
| 8 | self.delete(existing.id) |
| 9 | |
| 10 | self.add(fact) |
| 11 | |
Conclusion
Memory transforms agents from forgetful assistants into intelligent systems that:
- Maintain context within and across sessions
- Learn preferences and personalize over time
- Access knowledge beyond training data
- Build expertise through accumulated experience
Start with simple conversation memory. Add long-term storage when you need persistence. Implement RAG when you have knowledge bases to query.
The agent that remembers outperforms the agent that forgets. Every time.
Ready to build agents with persistent memory and code execution? Get started with HopX — sandboxes that maintain state across sessions.
Further Reading
- What Is an AI Agent? — Agent fundamentals
- The Planning Pattern — Use memory to track plan execution
- Tool Use Pattern — Tools for memory operations
- Building a Code Interpreter — Stateful code execution
- Pinecone Documentation — Production vector database
- LangChain Memory — Memory abstractions