ContextFit is open source — MIT licensed MIT License ★ Star on GitHub
research preview · MIT license
99.0% R@5 fusion beats gbrain 97.6% and MemPalace 96.6% · LongMemEval-S n=500

Agent memory that thinks in
tokens, not vectors

ContextFit retrieves the right prior conversation without embedding APIs, vector databases, or GPU hardware — with structure-aware ingestion for MD, TXT, TMD ledger, JSON, CSV, email, calendar, and code files.

$ pip install contextfit
Use it via API· CLI· MCP
View on GitHub Read the Whitepaper How it works →
96.2% Token-native R@5 99.0% fusion beats gbrain + MemPalace
8.7ms Routed query latency
50× Faster than embed API retrieval
$0 API cost per query
41MB Total dependency footprint

Embedding-based memory has
three structural flaws —
and three costly consequences

Every major agent memory system in 2026 converts conversations to dense vectors before retrieval. This works for factual lookups — and breaks for the queries agents most frequently need to answer.

🌀

Semantic averaging

An entire session compressed into one vector loses episode structure. "What should I cook tonight?" shares almost no vector proximity with "I just harvested zucchini from my garden" — even though that's exactly the session the agent needs.

💸

Compounding API cost

Every memory access requires an embedding call. At real agent throughput — thousands of turns per day — this compounds into a meaningful infrastructure cost line, plus a vector database on top.

🔒

Data leaves the machine

Sending memory retrieval queries to an embedding API means your user's personal context crosses a network boundary on every turn. For privacy-sensitive use cases this is a non-starter.

🔧

Operational complexity

Embedding API + vector database + embedding model versioning + index management = three separate systems to provision, monitor, and keep synchronized.

🫥

Zero interpretability

When retrieval fails, the answer is "cosine distance was 0.62." There's no auditable explanation for why a session was or wasn't surfaced.

GPU dependency

Running embedding models locally requires PyTorch and benefits significantly from GPU acceleration — 500MB–2GB of dependencies before a single session is indexed.


Stay in token space,
end to end

The most valuable signals in conversational memory are structural, not semantic: what kind of memory did the user express? and does this episode's memory type match what this query needs? These questions are answerable with token-level pattern matching — no embedding model required.

PRIMITIVE 01

Memory Atoms

Deterministic, domain-agnostic fact extraction from user-authored turns. Eight typed primitives — preference, goal, constraint, decision, temporal update, open loop, interest, entity — extracted with zero API calls.

PRIMITIVE 02

Episode Relevance Scorer

Numeric session ranking by structural memory-signal type, not vector proximity. Answers "does this session have what this query type needs?" — the right question for vague advice queries.

PRIMITIVE 03

Query Router

Deterministic zero-cost dispatch to the right retrieval mode per query. Vague advice → episode scorer. Specific facts → BM25. Temporal state → fusion. No LLM routing, no latency.

PRIMITIVE 04

Structural Reranker

Ten token-native features re-score BM25 candidates: lexical overlap, behavior-marker alignment, named entity overlap, question-type slot matching. Keeps exact-match paths interpretable and local.

PRIMITIVE 05

Preference Reranker

Router-gated taste retrieval for personalized recommendations. It promotes user-authored preference evidence with marker detection, preference-window overlap, and lightweight token normalization — beating embedding baselines on preference R@1.

PRIMITIVE 06

Evidence-Coverage Reranker

For multi-session synthesis, preserves the strongest anchor and promotes companion sessions with complementary evidence — lifting LongMemEval multi-session All@5 from 55.4% to 65.3% without embeddings.

raw session text
BPE tokenize
atom extraction
BM25 index
query router
structural / preference / coverage rerank
ranked sessions

Lower cost per
correct answer

RAG cost is not just search cost. It is every oversized prompt, every rerank, and every retry needed before the model finally sees the right evidence. ContextFit is designed to retrieve less, better — reducing the context loops between question and correct answer.

cost / correct answer = retrieval + context tokens + rerank + retries
  • 01Evidence found early. Higher Recall@K and MRR means fewer second-pass searches.
  • 02Smaller useful context. Send the model the evidence it needs, not a giant safety blanket of maybe-relevant chunks.
  • 03Handle-based citations. Keep provenance in an expiring sidecar reference map while the model sees tiny @r1 handles.
  • 04Fewer iteration loops. Better first-pass retrieval cuts query expansion, reranking, and answer regeneration.
  • 05Lower latency and spend. Fewer tokens and fewer retries compound across every agent turn.
Traditional RAG loop
embed search rerank stuff context miss retry larger
ContextFit path
token-native search precise evidence @r citations smaller prompt answer
fewerirrelevant chunks in context
fewerretrieval and rerank loops
lowerLLM input token spend
cleanexpiring provenance references
fastertime to source-backed answer

Public memory retrieval,
measured like gbrain.

LongMemEval-S session retrieval asks one clean question: did the system put a ground-truth answer session in the top 5? No answer generation, no LLM judge, no hidden scoring model. Answer-quality benchmarks stay separate because they depend on the downstream model and judge.

Token-native
96.20% R@5
LongMemEval-S session retrieval across all 500 questions, with no embeddings, no vector DB, and no LLM in retrieval.
Optional fusion
99.00% R@5
Route-gated OpenAI chunk-vector fusion plus certificates; still no vector database and no LLM in the retrieval loop.
Published reference
97.60% gbrain
gbrain-hybrid reports 97.60% R@5, compared against MemPalace raw at 96.6% R@5 on the same LongMemEval-S retrieval metric.
System R@1 R@3 R@5 R@10 Embeddings Vector store
gbrain-hybrid published 97.60% yes local
MemPalace raw published 96.60% yes local
ContextFit token-native 81.80% 90.40% 96.20% 97.80% no no
ContextFit + OpenAI fusion 84.60% 95.20% 99.00% 99.60% yes no
LongMemEval-S session retrieval, all 500 questions, top-5 hit against answer_session_ids. ContextFit scores are from local full-run artifacts; gbrain and MemPalace are published reference rows. This is retrieval recall, not answer accuracy. MemPalace raw is the 96.6% R@5 zero-API baseline cited by gbrain-evals and MemPalace benchmark docs. Full run notes: ContextFit comparison artifact.
Agent-memory eval R@1 R@3 MRR API Cost GPU
Mem0 79-case 54.4%91.1%0.716 LLM + embed
Cohere embed-english-v3 58.7%91.4%0.751 embed API
ContextFit + routed rerankers 62.7% 94.0% 0.784 $0 ✓ CPU
OpenAI text-embedding-3-small 63.1%96.6%0.792 embed API
Agent-memory retrieval eval. Mem0 was measured on the original 79-case subset; Cohere, ContextFit, and OpenAI were measured on the expanded 499-case benchmark. ContextFit beats Cohere and Mem0 on aggregate R@1, but still trails OpenAI text-embedding-3-small on aggregate R@1/R@3. LongMemEval evidence and QA results are separate: retrieval artifact and QA artifact.
Recall@1 — agent memory eval (Mem0 79-case; others 499-case)
Mem0
54.4%
Cohere embed-v3
58.7%
ContextFit
62.7%
OpenAI embed-3-small
63.1%

Strengths where AI agents most need them. The hardest memory failures are rarely simple keyword lookups — they are open loops, preferences, temporal changes, and cross-session synthesis. These are the moments where better retrieval reduces retries, preserves user trust, and turns memory into useful action.

open_loop_retrieval

+16.4 pts
80.3% vs 63.9% — structural markers dominate
Example“What did I say I still need to follow up on?”

preference_recommendation

+8.1 pts
85.5% vs 77.4% — token-native preference route wins
Example“What should I recommend based on what they like?”

temporal_supersession

+1.6 pts
49.2% vs 47.6% — certificates help, but routing remains active
Example“Which support inbox am I using now?”

multi_session_synthesis

−5.4 pts
82.1% vs 87.5% — evidence-coverage route narrows the gap
Example“Why is this launch at risk across all the notes?”

No database. No GPU.
No vendor lock-in.

The token-native architecture isn't just a performance choice — it's a deployment choice. ContextFit runs anywhere a Python process runs.

✓ eliminated

No vector database

The index is a directory of flat files — zstd-compressed token arrays, BM25 postings, LSH signatures. No Qdrant, no Pinecone, no pgvector. No service to start, no schema to migrate.

✓ eliminated

No GPU required

Zero PyTorch, zero CUDA, zero model weights. Every operation runs on CPU: BM25 scoring, roaring bitmap intersection, MinHash LSH, episode feature computation, structural reranking.

✓ 41MB total

Minimal footprint

Total dependency footprint is ~41MB (tiktoken + numpy + pyroaring + datasketch + zstandard). PyTorch alone is 500MB–2GB. Fits in a Lambda function, a slim container, or a mobile app bundle.

✓ offline-capable

Filesystem-native storage

Standard POSIX permissions. The index lives alongside your data — inside an encrypted vault, a git repo, a synced drive. Back it up with rsync. No DB dump procedures, no export formats.

Property ContextFit Embedding + vector DB
Database None Qdrant / Chroma / pgvector
GPU Not required Recommended for local models
Dependency size ~41MB 500MB–2GB+ (PyTorch alone)
Storage format Plain files DB-managed blobs
Permissions POSIX filesystem DB users / ACLs
Offline capable Yes (default path) No (API) / Partial (local model)
Backup Any file backup tool DB dump + vector store export
API cost (default) $0 Per-embedding call
Query latency 0.5–9ms in-process 50–500ms+ (embed + vector search)

Chunks follow document meaning,
not blind token windows

ContextFit still stores and searches token IDs, but file ingestion now chooses smarter boundaries first: Markdown headings, prose paragraphs, TMD ledger rows, JSON events, CSV/TSV records, email messages, calendar events, and source code symbols. The result is more auditable evidence with richer per-chunk metadata.

✓ Markdown

Heading-aware sections

Heading path, section level, and ordinal metadata travel with each chunk. Paragraphs, lists, tables, blockquotes, and code fences stay together where possible.

✓ Plain text

Paragraph-aware grouping

Text files are grouped by natural paragraphs and separators, with overlap by whole paragraph rather than arbitrary token tails.

✓ TMD ledger

Row-aware retrieval

TMD ledger files chunk by rows while preserving schema and front-matter context, making source rows easier to cite and verify.

✓ JSON / JSONL

Object-aware records

API exports, event logs, and chat streams chunk by object or line while retaining stable path, line, and index metadata.

✓ CSV / TSV

Header-preserving rows

Tabular exports chunk by row with header fields preserved in the source text, keeping ledgers, inventories, and reports easy to cite.

✓ Email

Message-aware chunks

Email files keep sender, recipient, subject, and date context attached to message body chunks for source-verifiable inbox memory.

✓ Calendar

Event-aware chunks

Calendar files chunk by event with summary, time, location, recurrence, organizer, and attendee metadata preserved.

✓ Code

Symbol-aware chunks

Common source files chunk around imports, classes, functions, and selectors with language, symbol, and line-range metadata preserved.


A tiny ledger format agents can actually trust

TMD ledger is a new ContextFit-proposed Tabular Markdown file format: a human-readable text file for records that should stay row-addressable, schema-aware, and easy to cite. It bridges Markdown notes and structured data without requiring a database.

plain text

Readable without tooling

Ledgers stay in normal files, work in git, and remain understandable to humans reviewing purchases, assets, tasks, contacts, or memory records.

row-native

Records stay atomic

Each row can be retrieved, cited, and audited as a source record instead of disappearing into an arbitrary token window.

agent-friendly

Schema travels with evidence

Front-matter and column context stay attached to chunks, so agents know what a row means without needing a separate database schema.


Simple, composable,
fully interpretable

Drop in as a memory layer. Query with a natural-language string. Get back ranked session IDs with source-linked evidence.

memory_retrieval.py
# Ingest sessions — no API calls, no GPU, ~4ms per session
from contextfit import RetrievalEngine

engine = RetrievalEngine()
engine.ingest_sessions(sessions)
engine.save("./memory_index")

# Query — auto-routes to the right retrieval mode
result = engine.query_auto(
    "what should I cook for dinner tonight?",
    top_k=5
)

# Returns ranked session IDs + route metadata
print(result["route"])        # → episode_score
print(result["session_ids"])  # → ["s_garden_harvest", ...]

# Or use individual modes directly
engine.rank_sessions_by_episode_score(query, top_k=10)
engine.query(query, method="hybrid", top_k=50)
engine.rerank_sessions_by_structure(query, bm25_order, session_texts)

Built in the open.
Free to use and extend.

MIT licensed. No cloud dependency. No vendor lock-in. Fork it, embed it, ship it.

⭐ GitHub Stars
🍴 Forks
🔓 Open Issues
MIT
📄 License
INSTALL

pip install contextfit

REQUIRES

Python 3.10+ · CPU only · ~41MB deps · No DB

LICENSE

MIT — use freely in commercial and open-source projects

CONTRIBUTE

Issues, PRs, and benchmark contributions welcome


Research

Token-Native Agent Memory

The full technical whitepaper: architecture, seven primitives, 499-case benchmark methodology, feature ablation, per-behavior analysis, and deployment architecture.

Read the Whitepaper View on GitHub LLM-readable Markdown