Skip to content

assafbk/OPRM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overflow Prevention Enhances Long-Context Recurrent LLMs (COLM 2025)

Assaf Ben-Kish, Itamar Zimerman, M. Jehanzeb Mirza, James Glass, Leonid Karlinsky, Raja Giryes

We present OPRM (Overflow Prevention for Recurrent Models), a training-free inference method for long-context recurrent LLMs. By mitigating recurrent memory overflows, OPRM ensures reliable inference, leading to significant gains in both synthetic and real-world long-context tasks. In addition, OPRM naturally performs context extension, allowing the model to handle sequences far longer than those it was originally trained on, all while being faster than vanilla inference and requiring a surprisingly small memory footprint.

LongBench results of leading recurrent LLMs, with and without ORPM inference:

Associative Recall results for Falcon-Mamba-Inst-7B, with and without OPRM inference:

In addition, our findings raise questions about whether recurrent models genuinely exploit long-range dependencies across multiple chunks, since our single-chunk strategy delivers stronger performance - even in tasks that presumably require cross-segment relations.


Release Updates

  • [13/5/2025] Code published!

Setup

Clone Project

git clone https://github.com/assafbk/OPRM.git
cd OPRM

Set up custom Transformers library:

git submodule init
git submodule update

Create Environment

To set up our environment, please run:

conda env create -f environment.yml
conda activate oprm

Install custom Transformers library:

cd submodules/transformers
pip install -e .

Install for faster inference with the Falcon-Mamba models:

pip install "causal-conv1d>=1.4.0"
pip install mamba-ssm

Install for faster inference with the RWKV model:

pip install --no-deps git+https://github.com/sustcsonglin/flash-linear-attention

Evaluate OPRM on LongBench

Generate LongBench Predictions

python eval_longbench_oprm.py --device <cuda_device> --model <model_type> --e <e> --is_oprm <is_oprm>

Arguments:

  • <cuda_device> - str, in the form of 'cuda:x', where x is the gpu id
  • <model_type> - str, currently supported models: 'falcon_mamba', 'falcon3_mamba', 'recurrent_gemma', 'rwkv'
  • <e> - int, 0 for LongBench, 1 for LongBench_e
  • <is_oprm> - int, 0 for vanilla inference, 1 for OPRM

Additional Configurations (change in code, here):

  • <cache_dir> - str, HuggingFace cache dir
  • <out_path_base> - str, base path for model predictions
  • <max_len_per_seg> - int, max amount of tokens allowed in a batch. useful for very long sequences, when not all context chunks fit in a single batch.
  • <chunk_sizes_to_test> - list, chunk sizes to test (L in the paper).
  • <datasets_to_test> - list, datasets to evaluate. select a subset of: ["hotpotqa", "2wikimqa", "musique", "narrativeqa", "qasper", "multifieldqa_en", "gov_report", "qmsum", "multi_news", "trec", "triviaqa", "samsum", "passage_count" "passage_retrieval_en", "lcc", "repobench-p"]
  • <dataset_ntoks> - dict, maps between dataset and max amount of tokens allowed to predict per query.

Evaluate Predictions

cd submodules/LongBench/LongBench

If adding a new model:

mkdir -p pred/<model_type>
mkdir -p pred_e/<model_type>

Copy the predictions file into pred/<model_type> (or pred_e/<model_type> if evaluating LongBench_e) and run:

python eval.py --model <model_type>

Add the --e flag if evaluating LongBench_e.
The results should be in the results.json file, in the same dir as the copied predictions file.

Acknowledgments

This work was possible thanks to the Mamba, LongBench, and Transformers libraries. We would like to thank them for their great work!

Citation

If you find this work useful, please cite the following:

@misc{benkish2025overflowpreventionenhanceslongcontext,
      title={Overflow Prevention Enhances Long-Context Recurrent LLMs}, 
      author={Assaf Ben-Kish and Itamar Zimerman and M. Jehanzeb Mirza and James Glass and Leonid Karlinsky and Raja Giryes},
      year={2025},
      eprint={2505.07793},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.07793}, 
}

About

Overflow Prevention Enhances Long-Context Recurrent LLMs (COLM 2025)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages