VEKTOR v1.7.2 + LongMemEval 79.0% — Benchmark Results

LongMemEval: 79.0%

We ran VEKTOR Slipstream v1.7.2 against LongMemEval this week.

System Score
VEKTOR v1.7.2 79.0%
Full-context GPT-4 67.0%
Mem0 62.0%
ReadAgent 55.0%
MemGPT 49.0%

LongMemEval tests 105 questions across real multi-session conversations, averaging 344 stored memories per question. Full-context GPT-4 scores 67% — VEKTOR beats it by 12 points running on local SQLite with no cloud dependency.

What drove the result — routed ingest:

Different memory types get different treatment at write time. Temporal facts, multi-session memories, and knowledge updates go through LLM extraction (gpt-4o-mini extracts discrete factual statements with resolved dates). Single-session turns go in as raw text.

By category:

  • Temporal reasoning: 100% (15/15)
  • Abstention: 90% (9/10)
  • Single-session-assistant: 86.7% (13/15)
  • Single-session-user: 80.0% (16/20)
  • Multi-session: 75.0% (15/20) — up 30 points from v3
  • Knowledge-update: 66.7% (10/15) — needs work
  • Single-session-preference: 50.0% (5/10) — needs work

Target for v5 is 85% with supersession model fixes for knowledge updates and a dedicated preference recall pathway.

Full writeup: https://medium.com/@vektormemory