Benchmarks

PLUR is measured against two questions: does its retrieval actually surface the right memory? and does it help an agent answer correctly when local knowledge matters?

The short answers, with each number’s source stated:

Retrieval recall — LongMemEval, R@5 at chunk granularity: 97.6% with hybrid search + the local cross-encoder reranker; 97.0% hybrid with cloud embeddings (openai-3-large); 92.2% BM25-only. (Published run — README and plur.ai/benchmark.html.)
In-repo cadence harness — LongMemEval Hit@5, n = 30 sanity subset: 90.0% with the reranker on, 76.7% with it off. (benchmark/run.ts, the suite every PR is compared against.)
Agent task impact — same task with and without PLUR memory: 89% win rate (31 wins / 4 losses in decisive contests, LLM-judged); house rules 12–0 across Haiku, Sonnet, and Opus.

Retrieval recall (finding the right memory) and end-to-end answer accuracy (whether the model then answers correctly) are different axes — PLUR measures and reports them separately, never conflated.

The test setup

Two suites:

Local knowledge A/B bench (datacore-bench) — the same task given to an agent with and without PLUR memory, scored by a pairwise LLM judge. Scenarios test whether the agent can apply project-specific conventions (“our deploys go through Caddy”, “use this internal helper”). Ties are excluded from the win-rate denominator.
LongMemEval — a published academic retrieval benchmark. Questions span 6 categories: single-session facts, preferences, multi-session reasoning, temporal reasoning, knowledge updates, assistant facts.

PLUR runs fully local: BM25 + BGE embeddings + Reciprocal Rank Fusion, optionally followed by a local cross-encoder rerank pass. No API calls, no network round-trips.

Two LongMemEval numbers appear on this page and they are not the same measurement:

Number	Metric	Corpus	Where it comes from
97.6%	R@5, chunk granularity	Full published run	README / plur.ai benchmark page
90.0%	Hit@5, engram fixtures	n = 30 sanity subset	In-repo `benchmark/run.ts` cadence harness

The 30-question harness exists to catch regressions per PR cheaply; the chunk-granularity run is the published retrieval-recall figure. Don’t compare either against another vendor’s differently-defined score.

Retrieval recall — LongMemEval

Headline stacks (R@5, chunk granularity):

Stack	R@5	Notes
PLUR hybrid + reranker	97.6%	fully local cross-encoder — no API
PLUR hybrid (openai-3-large embeddings)	97.0%	optional cloud embedder
PLUR BM25 only	92.2%	no embedder — fully airgapped

Per-category recall (hybrid + openai-3-large embeddings, chunk granularity):

Category	R@5	R@10
single-session-assistant	100.0%	100.0%
knowledge-update	98.7%	100.0%
single-session-user	98.6%	100.0%
multi-session	97.7%	99.2%
temporal-reasoning	94.0%	97.0%
single-session-preference	93.3%	96.7%

The reranker

Since 0.11.0, PLUR ships an optional cross-encoder rerank pass on top of RRF fusion. It is opt-in via the PLUR_RERANKER environment variable (default: off). Two tiers:

Tier	Model	Size	Latency (CPU)
Quality	`bge-reranker-v2-m3`	568M, multilingual	seconds per query
Tiny (shipped 0.11.0)	`ms-marco-minilm-l6`	22.7M, English	tens of ms per query

On the in-repo cadence harness (n = 30), the reranker moves Hit@5 from 76.7% to 90.0%, and temporal-reasoning R@5 from 60% to 100% — temporal questions are where fusion order alone most often puts the right engram just below the cutoff.

The catch: cross-encoders can be net-negative on out-of-domain stores. That’s why the reranker isn’t on by default, and why there’s a per-store gate: plur rerank-eval samples your store’s own engrams, synthesizes probe queries, and compares reranked ordering against plain RRF — reporting ΔMRR, Hit@1 movement, and a verdict (helpful / harmful / insufficient-data). The verdict is cached per store and surfaced by plur_doctor. It is advisory — a harmful verdict never silently disables reranking, but the command exits non-zero so scripts can gate on it.

Latency: with the quality-tier reranker, the v0.9.13 cadence run measured p50 ≈ 3s and p95 ≈ 10s per recall with ~2 GB peak RSS — acceptable for agentic use, not for latency-sensitive interactive paths. The tiny tier exists precisely for the latter.

Local knowledge — the A/B bench

Same task, same model, with and without PLUR memory, scored by a pairwise LLM judge:

89% win rate in decisive contests (31 wins, 4 losses; ties excluded).
House rules 12–0: on tasks that hinge on project conventions, memory won every contest across Haiku, Sonnet, and Opus. Without memory the agent guesses; with memory it knows. This is the canonical case for installing PLUR.
Haiku with PLUR memory outperformed Opus without it on tool-routing and local-knowledge tasks — 2.6× better on tool routing, at roughly 10× lower cost. The bottleneck isn’t model intelligence; it’s context.

Methodology

Hardware: standard developer laptops (M-series Mac, no GPU required).
Embedder: bge-small-en-v1.5 (~50 MB, 384-dim), local.
Search: BM25 + embeddings + RRF; optional cross-encoder rerank pass (PLUR_RERANKER).
Cadence standard: npx tsx benchmark/run.ts --rerank on — hybrid + reranker is the baseline every PR is compared against.
Judges (A/B bench): pairwise LLM judge; ties reported separately and excluded from the win rate.

An earlier baseline (86.7% overall / 93.3% Hit@10, v0.2.1) circulated in older versions of this page. It survives only as the self-calibration target in the repo’s phase-2 methodology notes — it is not a current score and shouldn’t be cited as one.

Confidence levels — what we claim and don’t claim

Claim	Confidence	Caveat
97.6% R@5 (chunk granularity)	High	Published run; full methodology on plur.ai. Retrieval recall, not answer accuracy.
90.0% Hit@5 (n = 30)	Medium	Sanity subset on fixture corpus — a regression tripwire, not a headline.
89% A/B win rate	Medium-high	35 decisive contests. Measures Datacore + PLUR together (see below).
12–0 on house rules	High	Small n (12) but unambiguous, across 3 models.
Reranker +13.3 pts Hit@5	Medium	n = 30. Directionally consistent with the chunk-granularity gap (97.6 vs 92.2 BM25-only).

What we explicitly do not claim

PLUR is not a search engine. It indexes engrams, not arbitrary corpora. If you want to search PDFs, you want a different tool.
PLUR doesn’t make small models as smart as big ones. It makes small models correct about your specifics — a different axis from raw reasoning.
Retrieval recall is not answer accuracy. R@5 says the right memory was surfaced; whether the model then answers correctly is a separate measurement. An N=500 end-to-end answer-accuracy run is in progress, along with LoCoMo and agentic task suites.
The A/B bench measures Datacore + PLUR together. Datacore includes CLAUDE.md context and module loading alongside PLUR memory. We mention this every time we cite the 89% number.
Memory adds latency. Injection and MCP init add per-session overhead, and the quality-tier reranker adds seconds per recall. Unmeasurable across a long session; visible on 1-turn smoke tests.

Try it yourself

The in-repo harness lives in benchmark/:

# Standard cadence run (hybrid + reranker — the current baseline)
npx tsx benchmark/run.ts --rerank on

# Reranker-off comparison
npx tsx benchmark/run.ts --rerank off

# Single-category drill-down
npx tsx benchmark/run.ts --rerank on --category temporal_reasoning

Local runs write to benchmark/results/ (gitignored); archived result JSONs live in the plur-ai/plur-bench repo. See benchmark/README.md for corpus options and micro-benchmark instructions. Open an issue if a number doesn’t reproduce — that’s a bug.