Benchmarks
PLUR is measured against two questions: does it help an agent answer correctly when local knowledge matters? and does its retrieval actually surface the right memory?
The short answers:
- Local knowledge win rate: 89% of decisive head-to-head contests across Haiku 4.5, Sonnet 4.6, and Opus 4.5.
- LongMemEval (n = 30): 86.7% overall, 93.3% Hit@10 — zero API calls, zero cost.
- House-rules subset: 12 wins, 0 losses, 0 ties across every Claude model tested.
A full 500-question reproducible LongMemEval run is specified in plur-ai/plur#46; the 30-question sample numbers here will be re-cited when that lands.
The test setup
Section titled “The test setup”We give the same task to Claude — with and without PLUR memory. Two suites:
- Local knowledge benchmark (
datacore-bench) — 28 scenarios that test whether the agent can apply project-specific conventions (“our deploys go through Caddy”, “use this internal helper”). LLM judge decides the winner. Ties (no clear winner either way) are reported separately and excluded from the win-rate denominator. - LongMemEval — a published academic retrieval benchmark with 30 questions sampled across 6 categories: single-session facts, preferences, multi-session reasoning, temporal reasoning, knowledge updates, assistant facts.
PLUR runs in hybrid search mode (BM25 + local BGE embeddings + RRF fusion). No API calls, no network round-trips.
Local knowledge — 89% win rate
Section titled “Local knowledge — 89% win rate”28 scenarios × 3 models = 84 contests. 22 were inconclusive (both responses adequate). Of the 57 decisive contests, PLUR wins 89%.
| Knowledge type | PLUR wins | Vanilla wins | Win rate |
|---|---|---|---|
| House rules (project conventions) | 12 | 0 | 100% |
| Tool routing (which of N internal tools to use) | 9 | 1 | 90% |
| Server / infrastructure recall | 8 | 1 | 89% |
| Domain knowledge (specialised vocab) | 7 | 1 | 88% |
| API quirks / known bugs | 5 | 1 | 83% |
The house-rules row is the cleanest signal: 12 wins, 0 losses, 0 ties. Without memory, the agent guesses; with memory, the agent knows. This is the canonical case for installing PLUR.
Per-model breakdown
Section titled “Per-model breakdown”PLUR helps every model, but for different reasons. Cheaper models can’t explore — memory gives them navigation. Expensive models can explore — memory gives them things they can’t discover.
| Model | Win rate (decisive) | Cost per run | Notes |
|---|---|---|---|
| Haiku 4.5 + PLUR | 92% | ~$1 | Largest delta — Haiku without memory often takes wrong tool branches. |
| Sonnet 4.6 + PLUR | 87% | ~$3 | Reliable middle ground; biggest gain on tool routing. |
| Opus 4.5 + PLUR | 88% | ~$10 | Smallest delta in absolute terms, but Opus without memory still got house rules wrong 60% of the time. |
The headline that surprises people: Haiku with PLUR memory outperforms Opus without it. Haiku 4.5 averages 0.80 with PLUR on discoverability; Opus 4.5 averages 0.31 without. At ~10× lower cost. The bottleneck isn’t model intelligence — it’s context.
LongMemEval — 86.7% overall
Section titled “LongMemEval — 86.7% overall”LongMemEval tests whether a memory system can correctly answer questions about past conversations across:
| Category | PLUR (n = 30) |
|---|---|
| Single-session facts | High |
| User preferences | High |
| Multi-session reasoning | Mid |
| Temporal reasoning | Mid |
| Knowledge updates | 5/5 |
| Assistant-facts recall | High |
| Overall | 86.7% |
| Hit@10 | 93.3% |
Hit@10 of 93.3% is the key technical number: the right memory is in the top 10 results almost every time. The 6.6% gap from Hit@10 to overall accuracy is entirely in how the answering model synthesises the result.
Competitive context
Section titled “Competitive context”| System | LongMemEval (Opus answering) | Hit@10 | Notes |
|---|---|---|---|
| PLUR | 86.7% | 93.3% | Local. Zero API cost. |
| Mastra Cloud | 95% | — | Hosted; sends your data to their cloud. |
| Supermemory | 85.2% | — | Hosted. |
| Letta (Memgpt) | ~75% | — | Hosted. |
| Vanilla RAG baseline | ~30% | — | No agentic retrieval. |
The pattern: anything that ships your data to a cloud can score higher because they can run heavier rerank passes server-side. PLUR’s design trade-off is data sovereignty + zero per-query cost, accepting that the absolute ceiling is a few points lower than the cloud leaders.
The 56-point gap
Section titled “The 56-point gap”PLUR + Opus 4.5 answering: 86.7%PLUR + GPT-4o answering: ~30%Same retrieval. Same context. Different answering model. The retrieved memory is in both prompts; one model uses it, the other doesn’t. PLUR provides retrieval; the model provides reasoning. This is why we measure with multiple answering models and report the spread.
What we learned building this
Section titled “What we learned building this”Three insights that fed back into PLUR’s design:
The enriched schema is everything. Making BM25 and embeddings see entity names, temporal dates, and rationale text delivered +43 percentage points on LongMemEval versus indexing raw statements only. Generic search engines index text; PLUR indexes knowledge-enriched text. This is the PLUR-specific advantage.
Retrieval is not the bottleneck — answering is. At 93% Hit@10, the right memory is almost always in the top 10. The 56-point spread between models on the same context says: if your agent isn’t getting the right answer, recall isn’t the first thing to debug.
The cheapest model with memory beats the most expensive without it. Haiku 4.5 + PLUR at ~$1/run on discoverability tasks (avg 0.80) outperforms Opus 4.5 at ~$10/run with no memory (avg 0.31). Instead of spending 10× on a bigger model, spend ~0.1× on memory.
Methodology
Section titled “Methodology”- Hardware: standard developer laptops (M-series Mac, no GPU required for inference).
- Embedder: BGE-small-en-v1.5, local — runs in ~50 ms on CPU.
- Search: BM25 + embeddings + RRF, no rerank pass.
- Answering model (LongMemEval): Anthropic Claude Opus 4.5 unless otherwise stated. We re-ran with GPT-4o, Sonnet, and Haiku for the cross-model table.
- Judges (datacore-bench): pairwise LLM judge (Opus 4.5), seeded random order to avoid position bias.
- Reruns: each scenario × model executed 3× with different temperatures; reported numbers are medians.
Full raw data is in benchmark/ inside the PLUR monorepo.
Confidence levels — what we claim and don’t claim
Section titled “Confidence levels — what we claim and don’t claim”We try to be precise about what these numbers mean.
| Claim | Confidence | Caveat |
|---|---|---|
| 89% local-knowledge win rate | High | n = 57 decisive contests, 3 models. Stable across runs. |
| 12–0 on house rules | High | Across all 3 Claude models. Small n (12) but unambiguous result. |
| 86.7% LongMemEval | Medium | 30-question sample. True score likely 80–93%. Need a 500-question run. |
| Ahead of Supermemory | Low–medium | Supermemory may use the full 500 questions. Sample sizes differ. |
| Cost reframe (Haiku > Opus) | Medium-high | Holds on discoverability tasks. May not generalise to pure reasoning tasks. |
| Zero penalty on general tasks | Medium | Cold task scores are identical, but memory adds ~2× latency from MCP init + injection. |
What we explicitly do not claim
Section titled “What we explicitly do not claim”- PLUR is not a search engine. It indexes engrams, not arbitrary corpora. If you want to search PDFs, you want a different tool.
- PLUR doesn’t make small models as smart as big ones. It makes small models correct about your specifics — that’s a different axis from raw reasoning.
- 86.7% LongMemEval is not a 500-question score. It’s a 30-question sample. The full run is on the roadmap.
- The A/B bench measures Datacore + PLUR together. Datacore includes CLAUDE.md context and module loading alongside PLUR memory. We mention this every time we cite the 89% number.
- Memory adds latency. Cold-start MCP init plus engram injection adds ~150–300 ms per session. On 30-message sessions this is unmeasurable; on 1-turn smoke tests it’s ~2× the baseline.
Try it yourself
Section titled “Try it yourself”The benchmark code is open. From a memorybench checkout:
PLUR_SEARCH_MODE=hybrid python run.py --provider plurSearch modes:
| Mode | What it does |
|---|---|
fast | BM25 only. Instant. |
hybrid | BM25 + embeddings + RRF (default; the 86.7% number). |
agentic | LLM rerank. Slower; rarely worth it for small corpora. |
expanded | Query expansion + hybrid. Wins on multi-session questions. |
For the local-knowledge suite, see datacore-bench/scenarios/ — 28 scenarios you can run against your own setup.
Reproducibility
Section titled “Reproducibility”Every benchmark in this page is rerunnable from the PLUR monorepo. The data, scripts, judge prompts, and result dumps are committed. Open an issue if a number doesn’t reproduce — that’s a bug.