pgmnemo Benchmarks

pgmnemo Benchmarks

Status: v0.2.1 first honest results, retrieval-only mode

This document summarizes our public benchmark results, methodology, and the honest comparison vs published baselines.

TL;DR

Benchmark	pgmnemo v0.2.1	Notable comparison
LoCoMo retrieval (DRAGON, n=1982)	recall@10 = 0.795, MRR = 0.548 (session-level, paper-class)	paper-class range (DRAGON canonical, session granularity)
LongMemEval retrieval (bge-m3, n=500, s_cleaned)	recall@10 = 0.933, MRR = 0.855	Below in-repo BM25 baseline (0.982)

Reports + raw_retrievals + reproduction commands: - benchmarks/locomo/results/v0.2.1_20260509/ - benchmarks/longmemeval/results/v0.2.1_20260509/ — BM25 baseline (run_nollm.py) - benchmarks/longmemeval/results/v0.2.1_pgmnemo_20260509/ — pgmnemo vector (run_longmemeval_pgmnemo.py)

Methodology Conformance

LoCoMo (Maharana et al., ACL 2024)

Paper requirement	Our implementation	Status
Dataset	`snap-research/locomo10.json` (10 conversations, 1986 questions, 5 categories)	✅ verbatim
Embedder	facebook/dragon-plus (context+query)	✅ paper canonical
Retrieval k	k ∈ {5, 10, 25, 50}	✅ all reported
Metric (primary retrieval)	recall@K	✅
MRR (secondary)	yes	✅
LLM-as-judge accuracy (downstream eval)	n/a — retrieval-only mode	⚠️ deferred
Storage dim	768d (DRAGON native)	⚠️ DEVIATION: pgmnemo enforces vector(1024); we zero-pad 768→1024. Cosine similarity preserved (math-identical). See `ADDENDA/LOCOMO_EMBEDDER_PADDING.md`.

LongMemEval (Wu et al., ICLR 2025)

Paper requirement	Our implementation	Status
Dataset	`xiaowu0162/longmemeval-cleaned` (longmemeval_s_cleaned.json, 500 questions × ~47.7 sessions/haystack)	✅
Embedder	NovaSearch/stella_en_1.5B_v5 1024d	⚠️ DEVIATION: bundled `modeling_qwen.py` incompatible with transformers 5.8 (`Qwen2Config.rope_theta` AttributeError); substituted BAAI/bge-m3 (1024d, MTEB-strong, matches common production). See `ADDENDA/LONGMEMEVAL_EMBEDDER_BGE_M3.md`.
Retrieval (recall@K, NDCG@K, MRR)	recall@{1,5,10,20} + MRR	✅
Question types	5 (single-session-{user,assistant,preference}, multi-session, temporal-reasoning, knowledge-update + abstention variant)	✅
LLM-as-judge accuracy via `evaluate_qa.py`	n/a — retrieval-only mode	⚠️ deferred (no API key; paper supports retrieval-only)
Session truncation	500 chars per session (config bug, not hardware limit)	✅ no significant impact: QUICK-C re-run (v0.2.1_pgmnemo_20260509) recall@10 delta = 0.0008; addendum withdrawn.

Honest Findings

1. BM25 baseline outperforms pgmnemo vector on LongMemEval

recall@10:  pgmnemo vector (bge-m3) = 0.933  |  BM25 baseline = 0.982
recall@20:  pgmnemo vector (bge-m3) = 0.977  |  BM25 baseline = 0.996

Both metrics on the same dataset (longmemeval_s_cleaned, n=500). BM25 wins.

Hypothesized causes (under WG investigation): - LongMemEval questions have high keyword overlap with relevant sessions — BM25-friendly task - pgmnemo’s 5-component scoring may over-penalize on short queries - bge-m3 substitution (vs paper canonical Stella V5) may explain part of the gap - session truncation had near-zero impact (QUICK-C delta = 0.0008)

2. pgmnemo wins on certain question types

Q-type	pgmnemo recall@10	Notes
single-session-assistant	0.982	tied with BM25
multi-session	0.957	strong vs BM25-only baselines
temporal-reasoning	0.933	competitive
knowledge-update	0.923	competitive
single-session-preference	0.900	competitive
single-session-user	0.871	weakest

3. LoCoMo recall@10 = 0.366 below paper-reported retrievers

Likely causes (under WG investigation): - We index turn-level segments; paper may use session-level - 5-component scoring weights need calibration on this dataset - DRAGON 768d zero-padded → 1024d may have second-order HNSW effects (theoretically not, but worth verifying)

Reproducibility

# Full reproduction in 3 commands:
docker run -d --name pgmnemo-bench -p 15432:5432 \
  -e POSTGRES_PASSWORD=bench -e POSTGRES_USER=bench -e POSTGRES_DB=bench \
  pgvector/pgvector:pg17

docker exec pgmnemo-bench bash -c "apt-get update -qq && \
  apt-get install -y -qq postgresql-server-dev-17 build-essential && \
  cd /tmp/pgmnemo && make && make install"
docker exec pgmnemo-bench psql -U bench -d bench -c "CREATE EXTENSION pgmnemo CASCADE;"

# LoCoMo (DRAGON, ~2 min on Apple Silicon MPS)
python benchmarks/scripts/run_locomo_bench.py

# LongMemEval (bge-m3, ~16 min on Apple Silicon MPS)
python benchmarks/scripts/run_longmemeval_pgmnemo.py

Hardware used for published numbers: - Apple M-series Silicon (MPS GPU acceleration) - Python 3.11.14, torch 2.11, transformers 5.8, sentence-transformers 5.4 - Wall clock: LoCoMo 111s; LongMemEval 944s

What’s Next (WG-in-progress)

WG goals: 1. Investigate why BM25 beats vector retrieval on LongMemEval 2. Identify scoring formula tuning paths to close the gap 3. Reproduce paper-canonical Stella V5 (transformers downgrade or API-compat shim) 4. Compare against MAGMA (arxiv 2601.03236), Mem0, Zep, HippoRAG on same benchmarks 5. Roadmap pgmnemo v0.2.2 (calibration) → v0.3.0 (multi-graph + dim-flex)

References

Maharana, A. et al. (2024). “Evaluating Very Long-Term Conversational Memory of LLM-based Agents.” ACL 2024. arxiv 2402.17753
Wu, Z. et al. (2024). “LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory.” ICLR 2025. arxiv 2410.10813
Lin, S.-C. et al. (2023). “DRAGON+.” HF facebook/dragon-plus
BAAI/bge-m3 multilingual MTEB-strong embedder (1024d)
Wilson 1927 — score CIs

PGXN

PostgreSQL Extension Network

Contents