pgmnemo Benchmarks

Status: v0.2.1 first honest results, retrieval-only mode

This document summarizes our public benchmark results, methodology, and the honest comparison vs published baselines.


TL;DR

Benchmark pgmnemo v0.2.1 Notable comparison
LoCoMo retrieval (DRAGON, n=1982) recall@10 = 0.795, MRR = 0.548 (session-level, paper-class) paper-class range (DRAGON canonical, session granularity)
LongMemEval retrieval (bge-m3, n=500, s_cleaned) recall@10 = 0.933, MRR = 0.855 Below in-repo BM25 baseline (0.982)

Reports + raw_retrievals + reproduction commands: - benchmarks/locomo/results/v0.2.1_20260509/ - benchmarks/longmemeval/results/v0.2.1_20260509/ — BM25 baseline (run_nollm.py) - benchmarks/longmemeval/results/v0.2.1_pgmnemo_20260509/ — pgmnemo vector (run_longmemeval_pgmnemo.py)


Methodology Conformance

LoCoMo (Maharana et al., ACL 2024)

Paper requirement Our implementation Status
Dataset snap-research/locomo10.json (10 conversations, 1986 questions, 5 categories) ✅ verbatim
Embedder facebook/dragon-plus (context+query) ✅ paper canonical
Retrieval k k ∈ {5, 10, 25, 50} ✅ all reported
Metric (primary retrieval) recall@K
MRR (secondary) yes
LLM-as-judge accuracy (downstream eval) n/a — retrieval-only mode ⚠️ deferred
Storage dim 768d (DRAGON native) ⚠️ DEVIATION: pgmnemo enforces vector(1024); we zero-pad 768→1024. Cosine similarity preserved (math-identical). See ADDENDA/LOCOMO_EMBEDDER_PADDING.md.

LongMemEval (Wu et al., ICLR 2025)

Paper requirement Our implementation Status
Dataset xiaowu0162/longmemeval-cleaned (longmemeval_s_cleaned.json, 500 questions × ~47.7 sessions/haystack)
Embedder NovaSearch/stella_en_1.5B_v5 1024d ⚠️ DEVIATION: bundled modeling_qwen.py incompatible with transformers 5.8 (Qwen2Config.rope_theta AttributeError); substituted BAAI/bge-m3 (1024d, MTEB-strong, matches common production). See ADDENDA/LONGMEMEVAL_EMBEDDER_BGE_M3.md.
Retrieval (recall@K, NDCG@K, MRR) recall@{1,5,10,20} + MRR
Question types 5 (single-session-{user,assistant,preference}, multi-session, temporal-reasoning, knowledge-update + abstention variant)
LLM-as-judge accuracy via evaluate_qa.py n/a — retrieval-only mode ⚠️ deferred (no API key; paper supports retrieval-only)
Session truncation 500 chars per session (config bug, not hardware limit) no significant impact: QUICK-C re-run (v0.2.1_pgmnemo_20260509) recall@10 delta = 0.0008; addendum withdrawn.

Honest Findings

1. BM25 baseline outperforms pgmnemo vector on LongMemEval

recall@10:  pgmnemo vector (bge-m3) = 0.933  |  BM25 baseline = 0.982
recall@20:  pgmnemo vector (bge-m3) = 0.977  |  BM25 baseline = 0.996

Both metrics on the same dataset (longmemeval_s_cleaned, n=500). BM25 wins.

Hypothesized causes (under WG investigation): - LongMemEval questions have high keyword overlap with relevant sessions — BM25-friendly task - pgmnemo’s 5-component scoring may over-penalize on short queries - bge-m3 substitution (vs paper canonical Stella V5) may explain part of the gap - session truncation had near-zero impact (QUICK-C delta = 0.0008)

2. pgmnemo wins on certain question types

Q-type pgmnemo recall@10 Notes
single-session-assistant 0.982 tied with BM25
multi-session 0.957 strong vs BM25-only baselines
temporal-reasoning 0.933 competitive
knowledge-update 0.923 competitive
single-session-preference 0.900 competitive
single-session-user 0.871 weakest

3. LoCoMo recall@10 = 0.366 below paper-reported retrievers

Likely causes (under WG investigation): - We index turn-level segments; paper may use session-level - 5-component scoring weights need calibration on this dataset - DRAGON 768d zero-padded → 1024d may have second-order HNSW effects (theoretically not, but worth verifying)


Reproducibility

# Full reproduction in 3 commands:
docker run -d --name pgmnemo-bench -p 15432:5432 \
  -e POSTGRES_PASSWORD=bench -e POSTGRES_USER=bench -e POSTGRES_DB=bench \
  pgvector/pgvector:pg17

docker exec pgmnemo-bench bash -c "apt-get update -qq && \
  apt-get install -y -qq postgresql-server-dev-17 build-essential && \
  cd /tmp/pgmnemo && make && make install"
docker exec pgmnemo-bench psql -U bench -d bench -c "CREATE EXTENSION pgmnemo CASCADE;"

# LoCoMo (DRAGON, ~2 min on Apple Silicon MPS)
python benchmarks/scripts/run_locomo_bench.py

# LongMemEval (bge-m3, ~16 min on Apple Silicon MPS)
python benchmarks/scripts/run_longmemeval_pgmnemo.py

Hardware used for published numbers: - Apple M-series Silicon (MPS GPU acceleration) - Python 3.11.14, torch 2.11, transformers 5.8, sentence-transformers 5.4 - Wall clock: LoCoMo 111s; LongMemEval 944s


What’s Next (WG-in-progress)

WG goals: 1. Investigate why BM25 beats vector retrieval on LongMemEval 2. Identify scoring formula tuning paths to close the gap 3. Reproduce paper-canonical Stella V5 (transformers downgrade or API-compat shim) 4. Compare against MAGMA (arxiv 2601.03236), Mem0, Zep, HippoRAG on same benchmarks 5. Roadmap pgmnemo v0.2.2 (calibration) → v0.3.0 (multi-graph + dim-flex)


References

  • Maharana, A. et al. (2024). “Evaluating Very Long-Term Conversational Memory of LLM-based Agents.” ACL 2024. arxiv 2402.17753
  • Wu, Z. et al. (2024). “LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory.” ICLR 2025. arxiv 2410.10813
  • Lin, S.-C. et al. (2023). “DRAGON+.” HF facebook/dragon-plus
  • BAAI/bge-m3 multilingual MTEB-strong embedder (1024d)
  • Wilson 1927 — score CIs