LongMemEval: 79.0%
We ran VEKTOR Slipstream v1.7.2 against LongMemEval this week.
| System | Score |
|---|---|
| VEKTOR v1.7.2 | 79.0% |
| Full-context GPT-4 | 67.0% |
| Mem0 | 62.0% |
| ReadAgent | 55.0% |
| MemGPT | 49.0% |
LongMemEval tests 105 questions across real multi-session conversations, averaging 344 stored memories per question. Full-context GPT-4 scores 67% — VEKTOR beats it by 12 points running on local SQLite with no cloud dependency.
What drove the result — routed ingest:
Different memory types get different treatment at write time. Temporal facts, multi-session memories, and knowledge updates go through LLM extraction (gpt-4o-mini extracts discrete factual statements with resolved dates). Single-session turns go in as raw text.
By category:
- Temporal reasoning: 100% (15/15)
- Abstention: 90% (9/10)
- Single-session-assistant: 86.7% (13/15)
- Single-session-user: 80.0% (16/20)
- Multi-session: 75.0% (15/20) — up 30 points from v3
- Knowledge-update: 66.7% (10/15) — needs work
- Single-session-preference: 50.0% (5/10) — needs work
Target for v5 is 85% with supersession model fixes for knowledge updates and a dedicated preference recall pathway.
Full writeup: https://medium.com/@vektormemory