Conversational memory

Definition

Conversational memory refers to the set of techniques that allow a chat agent to retain and utilize information from previous turns in a dialogue. Unlike retrieval-augmented generation, which pulls in external documents, conversational memory is exclusively concerned with what has already been said between the user and the agent. Getting this right is what separates a frustrating chatbot that asks you to repeat yourself from an agent that feels genuinely attentive.

There are several distinct strategies for managing conversation history, each with different trade-offs between cost, fidelity, and scalability. The simplest approach—keeping every message verbatim—works fine for short conversations but quickly exhausts the model's context window. More sophisticated patterns use summarization or semantic indexing to compress or selectively retrieve the history that matters most for the current turn.

Choosing the right memory pattern depends heavily on the expected conversation length, the importance of exact wording versus semantic meaning, and the cost constraints of the deployment. In practice, production chat agents often combine two or more patterns: a short-term verbatim buffer for immediate coherence and a summary or vector layer for long-horizon recall.

How it works

Buffer memory

Buffer memory is the most straightforward pattern: the agent maintains an ordered list of the last N message pairs and prepends them to every new context window. When the buffer reaches capacity, the oldest pair is dropped (FIFO). This guarantees that the agent always has access to the most recent exchanges without any transformation or lossy compression. Buffer memory is ideal for short to medium conversations where recency is the primary signal, and it incurs no extra LLM calls. Its main weakness is that older context is silently lost without any summary.

Summary memory

Summary memory addresses the forgetting problem by using an LLM to periodically generate a running summary of the conversation so far. When the buffer grows too large, the agent condenses it into a compact narrative—capturing key facts, decisions, and sentiment—then discards the raw messages. The summary occupies far fewer tokens than the original turns, making long conversations tractable. The trade-off is a secondary LLM call for each summarization step, which adds latency and cost, and some information is inevitably lost in the compression.

Vector memory

Vector memory embeds each conversation turn and stores it in a vector database. On each new turn, the most semantically relevant past exchanges are retrieved by similarity search and injected into the context window alongside recent buffer messages. This pattern excels when conversations are very long or when the current question relates to something said many turns ago. Vector memory is the highest-fidelity approach for long-horizon recall but requires embedding infrastructure and introduces retrieval latency.

Entity memory

Entity memory extracts named entities—people, places, products, preferences—from the conversation and maintains a structured record of what the agent knows about each entity. When an entity is mentioned again, its stored profile is injected into the context. Entity memory is ideal for personal assistant use cases where remembering that "Alice prefers morning meetings" or "the project deadline is June 10" is more valuable than remembering the exact wording of past messages.

When to use / When NOT to use

Use when	Avoid when
Conversations span more than a handful of turns	The task is single-turn with no need for history
Users expect the agent to remember what they said earlier	Conversation data cannot be stored for privacy or compliance reasons
Context window costs are significant and history is long	The conversation is always short enough to fit fully in the context window
Users discuss multiple entities or topics across the session	Summarization latency is unacceptable for the use case
Cross-session recall is required (vector/entity patterns)	Added infrastructure complexity outweighs the fidelity benefit

Comparisons

Criterion	Buffer memory	Summary memory	Vector memory
Cost per turn	Low (no extra LLM call)	Medium (occasional summarizer call)	Medium (embedding call + DB query)
Fidelity of recall	Exact but bounded to last N turns	Lossy compression of older turns	High for semantically relevant content
Context length handling	Poor — oldest turns silently dropped	Good — summary compresses old turns	Excellent — retrieves only relevant chunks
Latency	Minimal	Moderate (summarization adds a step)	Moderate (embedding + nearest-neighbor search)
Cross-session recall	No (in-memory buffer)	Possible if summary is persisted	Yes (vector store is persistent)
Implementation complexity	Very low	Low–medium	Medium–high

Code examples

"""
Conversational memory patterns using LangChain.

Demonstrates:
1. ConversationBufferMemory  — keep verbatim last N messages
2. ConversationSummaryMemory — compress history into a running summary
3. ConversationBufferWindowMemory — sliding window variant
"""
# pip install langchain langchain-openai openai
from langchain.memory import (
    ConversationBufferMemory,
    ConversationSummaryMemory,
    ConversationBufferWindowMemory,
)
from langchain.chains import ConversationChain
from langchain_openai import ChatOpenAI


# ---------------------------------------------------------------------------
# 1. Buffer memory — keeps ALL messages (use for short conversations)
# ---------------------------------------------------------------------------
def demo_buffer_memory():
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    memory = ConversationBufferMemory(return_messages=True)
    chain = ConversationChain(llm=llm, memory=memory, verbose=False)

    reply1 = chain.predict(input="My name is Alice. I enjoy hiking.")
    reply2 = chain.predict(input="What outdoor activities would you recommend for me?")

    # The second call has access to the first turn verbatim
    print("Buffer memory — reply 2:", reply2)
    print("History length:", len(memory.chat_memory.messages), "messages\n")


# ---------------------------------------------------------------------------
# 2. Summary memory — LLM compresses history on each turn
# ---------------------------------------------------------------------------
def demo_summary_memory():
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    # The same LLM is used to generate summaries; you can use a cheaper model here
    memory = ConversationSummaryMemory(llm=llm, return_messages=True)
    chain = ConversationChain(llm=llm, memory=memory, verbose=False)

    chain.predict(input="I'm planning a trip to Japan next spring.")
    chain.predict(input="I'm most interested in traditional temples and local food.")
    reply3 = chain.predict(input="Can you suggest a one-week itinerary?")

    print("Summary memory — reply 3:", reply3)
    # The buffer contains only the latest summary, not all past raw messages
    print("Summary:", memory.moving_summary_buffer[:200], "...\n")


# ---------------------------------------------------------------------------
# 3. Window memory — keeps only the last k turns (sliding window)
# ---------------------------------------------------------------------------
def demo_window_memory(k: int = 3):
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    # k=3 means only the last 3 HumanMessage+AIMessage pairs are retained
    memory = ConversationBufferWindowMemory(k=k, return_messages=True)
    chain = ConversationChain(llm=llm, memory=memory, verbose=False)

    for i in range(6):
        reply = chain.predict(input=f"This is message number {i + 1}.")
        print(f"Turn {i + 1}: {reply[:80]}")

    print(
        f"\nWindow memory keeps {len(memory.chat_memory.messages)} messages "
        f"(max {k * 2} for k={k} turn pairs)\n"
    )


# ---------------------------------------------------------------------------
# Manual entity-style memory (illustrative, no extra dependency)
# ---------------------------------------------------------------------------
def demo_entity_memory_manual():
    """
    Minimal entity memory: parse key facts from each turn and inject them.
    In production, use LangChain's ConversationEntityMemory or a dedicated NER model.
    """
    entity_store: dict[str, str] = {}

    def extract_entities_mock(text: str) -> dict[str, str]:
        """Mock extraction — real impl would call an LLM or NER model."""
        entities = {}
        if "my name is" in text.lower():
            name = text.lower().split("my name is")[-1].strip().split()[0].rstrip(".,")
            entities["user_name"] = name.capitalize()
        if "deadline" in text.lower():
            entities["deadline"] = "mentioned but not parsed in this mock"
        return entities

    turns = [
        ("user", "My name is Bob and my project deadline is end of July."),
        ("user", "Can you help me prioritize my tasks?"),
    ]
    for role, msg in turns:
        entity_store.update(extract_entities_mock(msg))
        entity_context = "; ".join(f"{k}={v}" for k, v in entity_store.items())
        print(f"[{role}] {msg}")
        print(f"  Entity context injected: {entity_context}\n")


if __name__ == "__main__":
    import os

    if os.getenv("OPENAI_API_KEY"):
        demo_buffer_memory()
        demo_summary_memory()
        demo_window_memory()
    else:
        print("Set OPENAI_API_KEY to run LangChain demos.")
    demo_entity_memory_manual()

Practical resources

LangChain Memory Docs — Comprehensive reference for all LangChain memory classes with usage examples.
Rethinking Memory in Conversational AI (Lilian Weng) — Deep-dive blog post covering memory taxonomy and design trade-offs in agent systems.
MemoryOS: Memory-based Operating System for LLM Agents — Research on hierarchical memory management inspired by OS design.
OpenAI Assistants Thread Management — How OpenAI's managed API handles persistent conversation threads.

Definition​

How it works​

Buffer memory​

Summary memory​

Vector memory​

Entity memory​

When to use / When NOT to use​

Comparisons​

Code examples​

Practical resources​

See also​