MemGym

a Long-Horizon Memory Environment for LLM Agents

TL;DR — A lightweight, long-horizon evaluation gym for LLM-agent memory across coding, tool-use, computer-use, and deep-research scenarios.

Why MemGym?

Prior memory benchmarks test recall of personal facts in multi-turn chat. MemGym instead measures the memory an agent forms while it works — across coding, tool use, computer use, and deep research — where memory actually decides whether a task succeeds.

MemGym overview: five evaluation tracks across four agentic regimes, all sharing a single memory-reasoning interface that enables memory-isolated scoring.
Figure 1. Five tracks across four agentic regimes, behind one shared memory interface.

What MemGym tests

Five evaluation tracks spanning four agentic regimes. Each track pairs the best memory strategy against a same-reasoner, no-memory run, so the headline number is memory-isolated gain — not absolute task difficulty.

How it works

Every environment plugs into one memory module that wraps the prompt to the policy LLM, so a memory strategy is written once and runs everywhere — and is scored in isolation from reasoning and tool use. Recorded trajectories train MemRM, a lightweight reward model that judges whether a compression preserves the agent's behavior without re-running the agent.

MemGym architecture: five environments share a memory module that wraps the prompt to the policy LLM. Trajectories feed a replay-augmentation pipeline producing SAFE/HARMFUL labels for MemRM.
Figure 2. One memory module wraps every environment behind a single compress / recall / write contract — so the policy stays unchanged and memory is scored in isolation.

Fast path vs. full rollout. Measuring memory gain the full way takes two paired rollouts per instance — on SWE-Gym that is ≈110 turns and ≈$4.20 of Sonnet 4.5, and a 5-strategy × 3-seed sweep runs $6,300. MemRM replaces each ~10-minute Docker rollout with a sub-second scalar read at 0.985 AUROC.

Leaderboard

Results across all five evaluation tracks. Each row is paired with a same-reasoner, no-memory baseline (italic, pinned to the bottom of its group) so the Δ column reports memory-isolated gain, not absolute task performance. This page reads data/leaderboard.json at load time — appending a row to that file is all it takes to publish a new result.

Citation

If you use MemGym in your work, please cite:

@misc{xu2026memgym,
  title         = {MemGym: a Long-Horizon Memory Environment for LLM Agents},
  author        = {Xu, Wujiang and Wang, Yu and Mei, Kai and Liang, Kaiqu and Wang, Zhenting and Jin, Mingyu and Zhang, Han and Zhang, Shi-Xiong and Hua, Wenyue and Sahu, Sambit and Metaxas, Dimitris N.},
  year          = {2026},
  eprint        = {2605.20833},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI}
}

Cite MemGym

@misc{xu2026memgym,
  title         = {MemGym: a Long-Horizon Memory Environment for LLM Agents},
  author        = {Xu, Wujiang and Wang, Yu and Mei, Kai and Liang, Kaiqu and Wang, Zhenting and Jin, Mingyu and Zhang, Han and Zhang, Shi-Xiong and Hua, Wenyue and Sahu, Sambit and Metaxas, Dimitris N.},
  year          = {2026},
  eprint        = {2605.20833},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI}
}