cleanome

Cleanome is a Python-based helper toolkit designed to standardize and clean genome annotation files (especially GTFs) so they can be consumed by genomics reference builders like Cellranger (and Cellranger-ARC). At a high level, running a genome through Cleanome involves the following modifications and preprocessing steps:

Downloading core files: It fetches the FASTA sequence, GTF annotation, and associated assembly metadata for a list of species you provide (by scientific name or taxonomy ID), centralizing them in a working directory.
Diagnostic statistics: It computes basic assembly and annotation statistics across all genomes (e.g., contig sizes, number of genes/transcripts), producing a summary table used later for conditional corrections.
GTF (debugging):
- Adds missing gene/transcript IDs
- Deduplicates features
- Fixes structural nesting issues in GTF entries (improper exon/transcript hierarchies), ensuring the overall annotation tree is coherent.
- Splits overly large contigs when necessary (based on the genome statistics), accommodating tools that have contig size limits. The only mammalian species that I’ve seen need this is Monodelphis domestica.
Shell script generation for references: After cleaning, Cleanome can also write out shell scripts that will invoke the appropriate reference builder binary (e.g., the cellranger-arc CLI) with the cleaned GTF and accompanying FASTA, ready for indexing and full reference creation.

Installation

Make a anaconda environment with python>=3.6

conda install -c conda-forge ncbi-datasets-cli

conda install -c conda-forge -c bioconda ete3 gtfparse numpy pandas polars polars-lts-cpu pyarrow requests biopython tqdm 

git clone git@github.com:mtvector/cleanome.git
cd cleanome
pip install .

Usage

To debug a gtf by adding missing gene and transcript fields, replacing missing gene fields with the gene_id, and other common issues in NCBI genome annotations:

debug_gtf file.gtf file.debug.gtf

See example_run.sh for an example script utilizing the full pipeline to download genomes, get statistics, debug gtfs and build cellranger-arc references.

Current pipeline functions include:

download_genomes --species_list ./species.txt --genome_dir ./genomes/

Download genomes from NCBI for all the species on the list. A utility for downloading ENSEMBL genomes for the list is included (have to write your own download_genomes.py for now)

get_genomes_and_stats --genome_dir ./genomes/ -o ./genome_info.csv -c

Collects the fastas and gtfs from all the genomes in a directory and calculates some simple statistics.

make_cellranger_arc_sh --sh_scripts_dir ./submission_scripts/ --stats_csv ./genome_info.csv --output_dir ~/cellranger-arc --log_dir ~/log/ -cellranger_bin /path/to/cellranger-arc/bin/

Debugs gtfs (deduplicate transcripts/genes, add missing transcripts for exons and missing genes for transcripts, fills missing values with placeholders, split chromosomes that are too large, fix nesting)

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
cleanome		cleanome
.gitignore		.gitignore
README.md		README.md
example_run.sh		example_run.sh
pyproject.toml		pyproject.toml
species_list.txt		species_list.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cleanome

Installation

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

mtvector/cleanome

Folders and files

Latest commit

History

Repository files navigation

cleanome

Installation

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages