Contents
pgmnemo Benchmarks
Status: v0.2.1 first honest results, retrieval-only mode
This document summarizes our public benchmark results, methodology, and the honest comparison vs published baselines.
TL;DR
| Benchmark | pgmnemo v0.2.1 | Notable comparison |
|---|---|---|
| LoCoMo retrieval (DRAGON, n=1982) | recall@10 = 0.795, MRR = 0.548 (session-level, paper-class) | paper-class range (DRAGON canonical, session granularity) |
| LongMemEval retrieval (bge-m3, n=500, s_cleaned) | recall@10 = 0.933, MRR = 0.855 | Below in-repo BM25 baseline (0.982) |
Reports + raw_retrievals + reproduction commands:
- benchmarks/locomo/results/v0.2.1_20260509/
- benchmarks/longmemeval/results/v0.2.1_20260509/ — BM25 baseline (run_nollm.py)
- benchmarks/longmemeval/results/v0.2.1_pgmnemo_20260509/ — pgmnemo vector (run_longmemeval_pgmnemo.py)
Methodology Conformance
LoCoMo (Maharana et al., ACL 2024)
| Paper requirement | Our implementation | Status |
|---|---|---|
| Dataset | snap-research/locomo10.json (10 conversations, 1986 questions, 5 categories) |
✅ verbatim |
| Embedder | facebook/dragon-plus (context+query) | ✅ paper canonical |
| Retrieval k | k ∈ {5, 10, 25, 50} | ✅ all reported |
| Metric (primary retrieval) | recall@K | ✅ |
| MRR (secondary) | yes | ✅ |
| LLM-as-judge accuracy (downstream eval) | n/a — retrieval-only mode | ⚠️ deferred |
| Storage dim | 768d (DRAGON native) | ⚠️ DEVIATION: pgmnemo enforces vector(1024); we zero-pad 768→1024. Cosine similarity preserved (math-identical). See ADDENDA/LOCOMO_EMBEDDER_PADDING.md. |
LongMemEval (Wu et al., ICLR 2025)
| Paper requirement | Our implementation | Status |
|---|---|---|
| Dataset | xiaowu0162/longmemeval-cleaned (longmemeval_s_cleaned.json, 500 questions × ~47.7 sessions/haystack) |
✅ |
| Embedder | NovaSearch/stella_en_1.5B_v5 1024d | ⚠️ DEVIATION: bundled modeling_qwen.py incompatible with transformers 5.8 (Qwen2Config.rope_theta AttributeError); substituted BAAI/bge-m3 (1024d, MTEB-strong, matches common production). See ADDENDA/LONGMEMEVAL_EMBEDDER_BGE_M3.md. |
| Retrieval (recall@K, NDCG@K, MRR) | recall@{1,5,10,20} + MRR | ✅ |
| Question types | 5 (single-session-{user,assistant,preference}, multi-session, temporal-reasoning, knowledge-update + abstention variant) | ✅ |
LLM-as-judge accuracy via evaluate_qa.py |
n/a — retrieval-only mode | ⚠️ deferred (no API key; paper supports retrieval-only) |
| Session truncation | 500 chars per session (config bug, not hardware limit) | ✅ no significant impact: QUICK-C re-run (v0.2.1_pgmnemo_20260509) recall@10 delta = 0.0008; addendum withdrawn. |
Honest Findings
1. BM25 baseline outperforms pgmnemo vector on LongMemEval
recall@10: pgmnemo vector (bge-m3) = 0.933 | BM25 baseline = 0.982
recall@20: pgmnemo vector (bge-m3) = 0.977 | BM25 baseline = 0.996
Both metrics on the same dataset (longmemeval_s_cleaned, n=500). BM25 wins.
Hypothesized causes (under WG investigation): - LongMemEval questions have high keyword overlap with relevant sessions — BM25-friendly task - pgmnemo’s 5-component scoring may over-penalize on short queries - bge-m3 substitution (vs paper canonical Stella V5) may explain part of the gap - session truncation had near-zero impact (QUICK-C delta = 0.0008)
2. pgmnemo wins on certain question types
| Q-type | pgmnemo recall@10 | Notes |
|---|---|---|
| single-session-assistant | 0.982 | tied with BM25 |
| multi-session | 0.957 | strong vs BM25-only baselines |
| temporal-reasoning | 0.933 | competitive |
| knowledge-update | 0.923 | competitive |
| single-session-preference | 0.900 | competitive |
| single-session-user | 0.871 | weakest |
3. LoCoMo recall@10 = 0.366 below paper-reported retrievers
Likely causes (under WG investigation): - We index turn-level segments; paper may use session-level - 5-component scoring weights need calibration on this dataset - DRAGON 768d zero-padded → 1024d may have second-order HNSW effects (theoretically not, but worth verifying)
Reproducibility
# Full reproduction in 3 commands:
docker run -d --name pgmnemo-bench -p 15432:5432 \
-e POSTGRES_PASSWORD=bench -e POSTGRES_USER=bench -e POSTGRES_DB=bench \
pgvector/pgvector:pg17
docker exec pgmnemo-bench bash -c "apt-get update -qq && \
apt-get install -y -qq postgresql-server-dev-17 build-essential && \
cd /tmp/pgmnemo && make && make install"
docker exec pgmnemo-bench psql -U bench -d bench -c "CREATE EXTENSION pgmnemo CASCADE;"
# LoCoMo (DRAGON, ~2 min on Apple Silicon MPS)
python benchmarks/scripts/run_locomo_bench.py
# LongMemEval (bge-m3, ~16 min on Apple Silicon MPS)
python benchmarks/scripts/run_longmemeval_pgmnemo.py
Hardware used for published numbers: - Apple M-series Silicon (MPS GPU acceleration) - Python 3.11.14, torch 2.11, transformers 5.8, sentence-transformers 5.4 - Wall clock: LoCoMo 111s; LongMemEval 944s
What’s Next (WG-in-progress)
WG goals: 1. Investigate why BM25 beats vector retrieval on LongMemEval 2. Identify scoring formula tuning paths to close the gap 3. Reproduce paper-canonical Stella V5 (transformers downgrade or API-compat shim) 4. Compare against MAGMA (arxiv 2601.03236), Mem0, Zep, HippoRAG on same benchmarks 5. Roadmap pgmnemo v0.2.2 (calibration) → v0.3.0 (multi-graph + dim-flex)
References
- Maharana, A. et al. (2024). “Evaluating Very Long-Term Conversational Memory of LLM-based Agents.” ACL 2024. arxiv 2402.17753
- Wu, Z. et al. (2024). “LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory.” ICLR 2025. arxiv 2410.10813
- Lin, S.-C. et al. (2023). “DRAGON+.” HF facebook/dragon-plus
- BAAI/bge-m3 multilingual MTEB-strong embedder (1024d)
- Wilson 1927 — score CIs