ContextFit — Token-Native Agent Memory

The Problem

Embedding-based memory has
three structural flaws —
and three costly consequences

Every major agent memory system in 2026 converts conversations to dense vectors before retrieval. This works for factual lookups — and breaks for the queries agents most frequently need to answer.

🌀

Semantic averaging

An entire session compressed into one vector loses episode structure. "What should I cook tonight?" shares almost no vector proximity with "I just harvested zucchini from my garden" — even though that's exactly the session the agent needs.

💸

Compounding API cost

Every memory access requires an embedding call. At real agent throughput — thousands of turns per day — this compounds into a meaningful infrastructure cost line, plus a vector database on top.

🔒

Data leaves the machine

Sending memory retrieval queries to an embedding API means your user's personal context crosses a network boundary on every turn. For privacy-sensitive use cases this is a non-starter.

🔧

Operational complexity

Embedding API + vector database + embedding model versioning + index management = three separate systems to provision, monitor, and keep synchronized.

🫥

Zero interpretability

When retrieval fails, the answer is "cosine distance was 0.62." There's no auditable explanation for why a session was or wasn't surfaced.

⚡

GPU dependency

Running embedding models locally requires PyTorch and benefits significantly from GPU acceleration — 500MB–2GB of dependencies before a single session is indexed.

First Principles

Stay in token space,
end to end

The most valuable signals in conversational memory are structural, not semantic: what kind of memory did the user express? and does this episode's memory type match what this query needs? These questions are answerable with token-level pattern matching — no embedding model required.

PRIMITIVE 01

Memory Atoms

Deterministic, domain-agnostic fact extraction from user-authored turns. Eight typed primitives — preference, goal, constraint, decision, temporal update, open loop, interest, entity — extracted with zero API calls.

PRIMITIVE 02

Episode Relevance Scorer

Numeric session ranking by structural memory-signal type, not vector proximity. Answers "does this session have what this query type needs?" — the right question for vague advice queries.

PRIMITIVE 03

Query Router

Deterministic zero-cost dispatch to the right retrieval mode per query. Vague advice → episode scorer. Specific facts → BM25. Temporal state → fusion. No LLM routing, no latency.

PRIMITIVE 04

Structural Reranker

Ten token-native features re-score BM25 candidates: lexical overlap, behavior-marker alignment, named entity overlap, question-type slot matching. Keeps exact-match paths interpretable and local.

PRIMITIVE 05

Preference Reranker

Router-gated taste retrieval for personalized recommendations. It promotes user-authored preference evidence with marker detection, preference-window overlap, and lightweight token normalization — beating embedding baselines on preference R@1.

PRIMITIVE 06

Evidence-Coverage Reranker

For multi-session synthesis, preserves the strongest anchor and promotes companion sessions with complementary evidence — lifting LongMemEval multi-session All@5 from 55.4% to 65.3% without embeddings.

raw session text

→

BPE tokenize

→

atom extraction

→

BM25 index

→

query router

→

structural / preference / coverage rerank

→

ranked sessions

Context Economics

Lower cost per
correct answer

RAG cost is not just search cost. It is every oversized prompt, every rerank, and every retry needed before the model finally sees the right evidence. ContextFit is designed to retrieve less, better — reducing the context loops between question and correct answer.

What to measure

cost / correct answer = retrieval + context tokens + rerank + retries

01Evidence found early. Higher Recall@K and MRR means fewer second-pass searches.
02Smaller useful context. Send the model the evidence it needs, not a giant safety blanket of maybe-relevant chunks.
03Handle-based citations. Keep provenance in an expiring sidecar reference map while the model sees tiny @r1 handles.
04Fewer iteration loops. Better first-pass retrieval cuts query expansion, reranking, and answer regeneration.
05Lower latency and spend. Fewer tokens and fewer retries compound across every agent turn.

Traditional RAG loop

embed→ search→ rerank→ stuff context→ miss→ retry larger

ContextFit path

token-native search→ precise evidence→ @r citations→ smaller prompt→ answer

fewerirrelevant chunks in context

fewerretrieval and rerank loops

lowerLLM input token spend

cleanexpiring provenance references

fastertime to source-backed answer

Benchmarks

Public memory retrieval,
measured like gbrain.

LongMemEval-S session retrieval asks one clean question: did the system put a ground-truth answer session in the top 5? No answer generation, no LLM judge, no hidden scoring model. Answer-quality benchmarks stay separate because they depend on the downstream model and judge.

Token-native

96.20% R@5

LongMemEval-S session retrieval across all 500 questions, with no embeddings, no vector DB, and no LLM in retrieval.

Optional fusion

99.00% R@5

Route-gated OpenAI chunk-vector fusion plus certificates; still no vector database and no LLM in the retrieval loop.

Published reference

97.60% gbrain

gbrain-hybrid reports 97.60% R@5, compared against MemPalace raw at 96.6% R@5 on the same LongMemEval-S retrieval metric.

System	R@1	R@3	R@5	R@10	Embeddings	Vector store
gbrain-hybrid published	—	—	97.60%	—	yes	local
MemPalace raw published	—	—	96.60%	—	yes	local
ContextFit token-native	81.80%	90.40%	96.20%	97.80%	no	no
ContextFit + OpenAI fusion	84.60%	95.20%	99.00%	99.60%	yes	no

LongMemEval-S session retrieval, all 500 questions, top-5 hit against answer_session_ids. ContextFit scores are from local full-run artifacts; gbrain and MemPalace are published reference rows. This is retrieval recall, not answer accuracy. MemPalace raw is the 96.6% R@5 zero-API baseline cited by gbrain-evals and MemPalace benchmark docs. Full run notes: ContextFit comparison artifact.

What ContextFit Adds

Agent memory is not just raw Recall@1. Embeddings are a useful signal, but production agents need routing, source aggregation, temporal state, citations, abstention, and local deployment controls. ContextFit is built around those agent behaviors first.

Capability	ContextFit	Raw embedding search	LLM-extracted memories
Temporal updates	Evidence certificates + routed rescue	Similarity only	Depends on extraction quality
Multi-session synthesis	Complementary source-set coverage	Top-k nearest chunks	Often loses episode context
Provenance	Expiring `@r` handles to source lines	Chunk IDs need app wiring	Derived facts need tracebacks
Abstention	Answerability and evidence gates	Returns nearest neighbor anyway	Can preserve stale facts
Privacy and deployment	Local files, CPU, no vector DB required	Usually API + vector store	Usually LLM API + embeddings

The internal 499-case agent-memory eval is still useful for engineering, but it is not a clean homepage leaderboard: Mem0 was measured on a 79-case subset, and aggregate embedding recall does not explain ContextFit's product advantage. Detailed QA and retrieval artifacts remain available in the research notes.

Per-Behavior · agent-memory eval

Strengths where AI agents most need them. The hardest memory failures are rarely simple keyword lookups — they are open loops, preferences, temporal changes, and cross-session synthesis. These are the moments where better retrieval reduces retries, preserves user trust, and turns memory into useful action.

open_loop_retrieval

+16.4 pts

80.3% vs 63.9% — structural markers dominate

Example“What did I say I still need to follow up on?”

preference_recommendation

+8.1 pts

85.5% vs 77.4% — token-native preference route wins

Example“What should I recommend based on what they like?”

temporal_supersession

+1.6 pts

49.2% vs 47.6% — certificates help, but routing remains active

Example“Which support inbox am I using now?”

multi_session_synthesis

−5.4 pts

82.1% vs 87.5% — evidence-coverage route narrows the gap

Example“Why is this launch at risk across all the notes?”

Deployment Architecture

No database. No GPU.
No vendor lock-in.

The token-native architecture isn't just a performance choice — it's a deployment choice. ContextFit runs anywhere a Python process runs.

✓ eliminated

No vector database

The index is a directory of flat files — zstd-compressed token arrays, BM25 postings, LSH signatures. No Qdrant, no Pinecone, no pgvector. No service to start, no schema to migrate.

✓ eliminated

No GPU required

Zero PyTorch, zero CUDA, zero model weights. Every operation runs on CPU: BM25 scoring, roaring bitmap intersection, MinHash LSH, episode feature computation, structural reranking.

✓ 41MB total

Minimal footprint

Total dependency footprint is ~41MB (tiktoken + numpy + pyroaring + datasketch + zstandard). PyTorch alone is 500MB–2GB. Fits in a Lambda function, a slim container, or a mobile app bundle.

✓ offline-capable

Filesystem-native storage

Standard POSIX permissions. The index lives alongside your data — inside an encrypted vault, a git repo, a synced drive. Back it up with rsync. No DB dump procedures, no export formats.

Property	ContextFit	Embedding + vector DB
Database	None	Qdrant / Chroma / pgvector
GPU	Not required	Recommended for local models
Dependency size	~41MB	500MB–2GB+ (PyTorch alone)
Storage format	Plain files	DB-managed blobs
Permissions	POSIX filesystem	DB users / ACLs
Offline capable	Yes (default path)	No (API) / Partial (local model)
Backup	Any file backup tool	DB dump + vector store export
API cost (default)	$0	Per-embedding call
Query latency	0.5–9ms in-process	50–500ms+ (embed + vector search)

Structure-Aware Ingestion

Chunks follow document meaning,
not blind token windows

ContextFit still stores and searches token IDs, but file ingestion now chooses smarter boundaries first: Markdown headings, prose paragraphs, TMD ledger rows, JSON events, CSV/TSV records, email messages, calendar events, and source code symbols. The result is more auditable evidence with richer per-chunk metadata.

✓ Markdown

Heading-aware sections

Heading path, section level, and ordinal metadata travel with each chunk. Paragraphs, lists, tables, blockquotes, and code fences stay together where possible.

✓ Plain text

Paragraph-aware grouping

Text files are grouped by natural paragraphs and separators, with overlap by whole paragraph rather than arbitrary token tails.

✓ TMD ledger

Row-aware retrieval

TMD ledger files chunk by rows while preserving schema and front-matter context, making source rows easier to cite and verify.

✓ JSON / JSONL

Object-aware records

API exports, event logs, and chat streams chunk by object or line while retaining stable path, line, and index metadata.

✓ CSV / TSV

Header-preserving rows

Tabular exports chunk by row with header fields preserved in the source text, keeping ledgers, inventories, and reports easy to cite.

✓ Email

Message-aware chunks

Email files keep sender, recipient, subject, and date context attached to message body chunks for source-verifiable inbox memory.

✓ Calendar

Event-aware chunks

Calendar files chunk by event with summary, time, location, recurrence, organizer, and attendee metadata preserved.

✓ Code

Symbol-aware chunks

Common source files chunk around imports, classes, functions, and selectors with language, symbol, and line-range metadata preserved.

TMD Ledger

A tiny ledger format agents can actually trust

TMD ledger is a new ContextFit-proposed Tabular Markdown file format: a human-readable text file for records that should stay row-addressable, schema-aware, and easy to cite. It bridges Markdown notes and structured data without requiring a database.

plain text

Readable without tooling

Ledgers stay in normal files, work in git, and remain understandable to humans reviewing purchases, assets, tasks, contacts, or memory records.

row-native

Records stay atomic

Each row can be retrieved, cited, and audited as a source record instead of disappearing into an arbitrary token window.

agent-friendly

Schema travels with evidence

Front-matter and column context stay attached to chunks, so agents know what a row means without needing a separate database schema.

API

Simple, composable,
fully interpretable

Drop in as a memory layer. Query with a natural-language string. Get back ranked session IDs with source-linked evidence.

memory_retrieval.py

# Ingest sessions — no API calls, no GPU, ~4ms per session
from contextfit import RetrievalEngine

engine = RetrievalEngine()
engine.ingest_sessions(sessions)
engine.save("./memory_index")

# Query — auto-routes to the right retrieval mode
result = engine.query_auto(
    "what should I cook for dinner tonight?",
    top_k=5
)

# Returns ranked session IDs + route metadata
print(result["route"])        # → episode_score
print(result["session_ids"])  # → ["s_garden_harvest", ...]

# Or use individual modes directly
engine.rank_sessions_by_episode_score(query, top_k=10)
engine.query(query, method="hybrid", top_k=50)
engine.rerank_sessions_by_structure(query, bm25_order, session_texts)

Open Source

Built in the open.
Free to use and extend.

MIT licensed. No cloud dependency. No vendor lock-in. Fork it, embed it, ship it.

—

⭐ GitHub Stars

—

🍴 Forks

—

🔓 Open Issues

MIT

📄 License

View on GitHub ★ Star the repo Open an issue Follow on X: @cponsart

INSTALL

pip install contextfit

REQUIRES

Python 3.10+ · CPU only · ~41MB deps · No DB

LICENSE

MIT — use freely in commercial and open-source projects

CONTRIBUTE

Issues, PRs, and benchmark contributions welcome

Agent memory that thinks in tokens, not vectors

Embedding-based memory hasthree structural flaws —and three costly consequences