This repository contains the code for the seqLens manuscript. In the seqLens project, we investigated genomics language models by implementing a set of DNA language models and benchmarked their efficiency.
Hugging Face models: omicseye/seqLens
seqLens manuscript preprint: https://doi.org/10.1101/2025.03.12.642848
Our research introduces several key innovations in DNA sequence modeling. We gathered two different pre-training datasets consisting of:
- 19,551 reference genomes with over 18,000 prokaryote genomes (over 115B nucleotides)
- A more balanced dataset composed of 1,355 prokaryote and eukaryote reference genomes (over 180B nucleotides)
We trained five different byte-pair encoding tokenizers and pre-trained 52 DNA language models. We introduce seqLens models, which are based on disentangled attention with relative positional encoding, and they outperform state-of-the-art models in 13 out of 19 benchmarking tasks.
Additionally, we explored:
- Domain-specific pre-training
- Token representation strategies
- Fine-tuning methods
Our findings show that:
- Using relevant pre-training data significantly boosts performance
- Alternative pooling techniques can enhance classification
- There is a trade-off between efficiency and accuracy between full and parameter-efficient fine-tuning
These insights provide a foundation for optimizing model design and training in biological research.
In this repository, we include code for the following tasks:
- Pre-training
- Benchmarking
- Visualization
- Different pooling techniques for classification
- Vector representations for DNA sequences
The following visualization shows how fine-tuning methods affect the performance of the models in vector representations:
You can use these models for your research or use the provided scripts to train your models.
# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("omicseye/seqLens_4096_512_89M-at-base")
model = AutoModelForMaskedLM.from_pretrained("omicseye/seqLens_4096_512_89M-at-base")All seqLens models were trained on NVIDIA Tesla V100 GPUs using distributed training via HuggingFace Accelerate.
| Model | Params | Vocab | Context | # GPUs | Batch / GPU | Effective Batch | Runtime (h) | GPU-hours | Avg TFLOPs |
|---|---|---|---|---|---|---|---|---|---|
seqLens_4096_512_15M |
15M | 4096 | 512 | 2 | 128 | 256 | 43.97 | 87.9 | 15.56 |
seqLens_4096_512_47M-at |
47M | 4096 | 512 | 2 | 64 | 128 | 50.11 | 100.2 | 13.76 |
seqLens_4096_512_89M-at-base |
89M | 4096 | 512 | 4 | 16 | 64 | 167.98 | 671.9 | 5.55 |
seqLens_4096_512_89M-at-base-multi |
89M | 4096 | 512 | 1 | 64 | 64 | 507.61 | 507.6 | 6.02 |
GPU-hours = runtime × number of GPUs
| # GPUs Used | Number of Runs | Typical Models |
|---|---|---|
| 1 GPU | 1 run | 89M-at-base-multi |
| 2 GPUs | Majority of runs | 15M, 23M, 46M, 47M |
| 4 GPUs | Several runs | 23M-at-xsmall, 47M-at-small, 89M-at-base |
- Most experiments were conducted using 2× V100 GPUs
- Larger models (≥ 89M parameters) benefit from 4 GPUs
- Single-GPU training is feasible but increases wall-clock time
- TFLOPs decrease for larger models due to memory and communication overhead
- Context length: 512 tokens (~2.8–3kb DNA sequence context)
- Vocabulary size: 4096 (BPE tokenizer)
- Mixed precision enabled
- Gradient checkpointing enabled for larger models
- Typical training schedule: up to 150K steps
- Runtime and TFLOPs logged per iteration
| Model Size | Recommended Hardware | Expected Runtime |
|---|---|---|
| 15M–47M | 2× V100 GPUs | ~40–60 hours |
| 89M | 4× V100 GPUs | ~160–170 hours |
| 89M (single GPU) | 1× V100 GPU | ~500 hours |
- GPU: NVIDIA Tesla V100
- Framework: PyTorch
- Distributed Training: HuggingFace Accelerate
- Logging: TFLOPs, runtime, and iteration time tracked per step
If you use seqLens in your research, please cite our work:
@article {seqLens,
author = {Baghbanzadeh, Mahdi and Mann, Brendan and Crandall, Keith A and Rahnavard, Ali},
title = {seqLens: optimizing language models for genomic predictions},
elocation-id = {2025.03.12.642848},
year = {2025},
doi = {10.1101/2025.03.12.642848},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/03/14/2025.03.12.642848},
eprint = {https://www.biorxiv.org/content/early/2025/03/14/2025.03.12.642848.full.pdf},
journal = {bioRxiv}
}
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC-BY-NC 4.0) License.
Commercial use of this software or any related models may require a separate licensing agreement due to a pending patent.
For commercial inquiries, please contact Ali Rahnavard at rahnavard@gwu.edu.
For any questions, feel free to email or open an issue in this repository.



