Latent Data-Driven RNA phenotyping
cd LaDDR
pip install -e .Run this command to generate a project directory myproject with the basic config and to template options for running the pipeline:
laddr init myproject
cd myprojectProject initialized at myproject
Files created:
myproject/
config.yaml: Edit this file to configure your project
run.sh: Shell script for running the project
Snakefile: Snakemake file for running the project
Either run.sh or Snakefile can be modified as needed to run the project. We recommend adapting the Snakefile for your project and using the Snakemake workflow system.
Aside from command line arguments for the dataset and/or batch for which to run a command, all parameters, file paths, etc. are specified in the config.yaml file, since they are often reused across commands. The default config contains parameters for the recommended binning and model fitting methods. To start with a config file containing all method options and parameters, use laddr init myproject --config-type extended.
examples/ contains two project directories set up to run example projects on two small datasets using either method. If you install the package from pip and do not have examples/, you can generate its two subdirectories with these commands:
laddr init shell --config-type example --template shell
laddr init snakemake --config-type example --template snakemakeThis tool uses base-level RNA-seq coverage in bigWig format. For BAM files, first convert them to bigWig using bamCoverage from deeptools:
bamCoverage -b /path/to/bams/sampleA.bam -o covg_bigwig/sampleA.bw -of bigwig --binSize 1 -p 4Then, make a coverage manifest file listing the dataset, sample ID, and path to all bigWig files, e.g.:
tissue1 S000 ../data_input/covg_bigwig/tissue1/S000.bw
tissue1 S001 ../data_input/covg_bigwig/tissue1/S001.bw
...
tissue8 S098 ../data_input/covg_bigwig/tissue8/S098.bw
tissue8 S099 ../data_input/covg_bigwig/tissue8/S099.bw
A latent phenotype table will be produced for each dataset containing phenotypes for its samples. The reason multiple datasets are supported is that the coverage data can be loaded together and the latent phenotype models can be fit jointly, so that the models are trained on biologically diverse data and the phenotypes correspond across datasets.
This method involves five main steps, plus a sixth optional analysis step, as outlined below. There are also options for using existing gene bins and models:
If existing gene bins are available for your desired species and gene annotations:
- Download the
info/andgene_bins/directories and place them in your project directory. - Set
use_existingtotruein thebinningsection of the config file. - Comment out the
laddr setupandladdr binningcommands in therun.shscript or theladdr_binningrule in theSnakefile, and do not runladdr setup.
If existing latent phenotype models are available for your desired species and gene annotations:
- Download the
info/,gene_bins/, andmodels/directories and place them in your project directory. If existing models are used, compatible gene bins must also be used to process the coverage data. - Set
use_existingtotruein both thebinningandfitsections of the config file. - Comment out the
laddr setup,laddr binning, andladdr fitcommands in therun.shscript or theladdr_binningandladdr_fitrules in theSnakefile, and do not runladdr setup.
This command prepares any data that needs to be generated once before the batch processing steps are run:
laddr setupThis must be run before running the Snakemake pipeline. It generates a table of gene information and a table of exon regions extracted from the gene annotation GTF file specified in the config file. It filters genes to include only those with gene_biotype/gene_type of "protein_coding" or "lncRNA", and assigns genes to batches so that the binning, coverage processing, and model fitting steps can be run in batches for parallelization and to reduce memory.
For certain adaptive binning methods, including the default method, this step also computes a cumulative variance threshold used in each batch of the binning step to produce the desired number of bins.
Beware editing the config file after running laddr setup, as the info generated may not be compatible with the updated parameters and cause silent bugs.
Then, split gene regions into bins to be used for summarizing and representing coverage data:
laddr binning --batch 0By default, binning is determined using coverage data, partitioning each gene in a way that aims to define more, smaller bins in areas of greater variation across samples. It excludes any bins that overlap the exon of another protein-coding or lncRNA gene.
For each gene batch, use its bin definitions to get mean coverage per bin for all samples and normalize:
laddr coverage --dataset dset1 --batch 0At this stage you can also provide quantified explicit phenotypes, e.g. from Pantry, to regress out. Training and applying models on this residualized coverage data results in "residual" latent RNA phenotypes, which can complement the explicit phenotypes by representing uncharacterized transcriptomic variation.
Normalized coverage data are loaded, one batch at a time, and used to fit a PCA (or FPCA) model for each gene. If a project involves multiple datasets, e.g. tissues, the coverage count matrices for each dataset will be loaded together, and the models will be fit on the concatenated data.
laddr fit --batch 0Genes for which too many samples (>50% by default) have zero coverage are excluded from the model fitting step.
Normalized coverage matrices are loaded, one batch at a time, and transformed using the fitted models. The transformed data is saved as a table of phenotypes by samples, i.e. multiple PCs per gene. If a project involves multiple datasets, the coverage count matrices for each dataset can be loaded and transformed separately using the same set of models, so that the phenotypes correspond across datasets.
laddr transform --dataset dset1Genes for which too many samples (>50% by default) have zero coverage in the provided dataset are excluded from the phenotypes, even if they were included in the model fitting step.
This command inspects the latent RNA phenotype model for a specific gene, and uses raw coverage, stored models, and latent phenotypes:
laddr inspect -g ENSG00000008128 -d dset1The output for a PCA model is a table with one row per bin:
gene_id pos mean std PC1 PC2 ... top_tenth_PC10 bottom_tenth_PC10
ENSG00000008128 -998 3.55851 0.308595 0.0338629 0.0656172 ... 3 3.36499
ENSG00000008128 -987 3.42529 0.289488 0.0031310 0.067382 ... 3.13462 3.36499
ENSG00000008128 -969 3.45428 0.237121 0.0074943 0.0712902 ... 3.15166 3.44746
The first four columns give the gene ID, bin center position relative to the gene TSS, and mean and standard deviation of normalized coverage used for training. (Normalized coverage values are three or higher because coverage is log2-transformed after adding a pseudocount of eight.) The next columns give the PCA loadings for each saved PC (latent phenotype). Finally, for each latent phenotype, the top and bottom 10% of samples according to their values for that phenotype are identified, and the mean (log-scaled, normalized) coverage in those samples is given. These values can be plotted to visualize the coverage patterns along the gene that each latent phenotype represents.
This flowchart shows how the workflow steps fit together:
flowchart LR
init --> setup
setup --> binning
subgraph Run with Snakemake etc.
binning --> coverage
coverage --> fit
fit --> transform
coverage --> transform
coverage --> inspect
fit --> inspect
transform --> inspect
end
Here is a more detailed flowchart that includes the type of input and output data for each step:
flowchart TD
init --> config[/"config file (edit as needed)"/]
config --> setup
bigwig[/bigWig RNA-seq coverage/] --> setup
gtf[/reference gene annotations/] --> setup
setup --> geneinfo[/processed gene annotations/]
setup --> binningparams[/binning parameters/]
setup --> covgnormparams[/coverage normalization parameters/]
config --> binning
bigwig --> binning
geneinfo --> binning
binningparams --> binning
binning --> bins[/gene bin definitions/]
config --> coverage
bins --> coverage
covgnormparams --> coverage
coverage --> covg[/binned normalized coverage/]
config --> fit
covg --> fit
fit --> models[/latent phenotype models/]
models --> transform
covg --> transform
transform --> phenotypes[/latent phenotypes/]
covg --> inspect
models --> inspect
phenotypes --> inspect
inspect --> modelinfo[/model info for plotting/analysis/]
classDef step fill:#ffcfff,stroke:#990099
classDef data fill:#d5d5f7,stroke:#2929d6
classDef input fill:#bdecfc,stroke:#0995c3
class init,setup,binning,coverage,fit,transform,inspect step
class config,geneinfo,binningparams,covgnormparams,bins,covg,models,phenotypes,modelinfo data
class gtf,bigwig input
style phenotypes fill:#88ffcc,stroke:#009957
This flowchart shows templates of the names of files produced by each step under the default project structure:
flowchart TD
setup --> genebininfo[/"info/genes.tsv info/exons.tsv.gz info/var_per_bin.txt"/]
setup --> covgnormparams[/"info/scaling_factors.tsv"/]
genebininfo --> binning
binning --> bins[/"gene_bins/batch_{batch}.bed.gz"/]
bins --> coverage
covgnormparams --> coverage
coverage --> covg[/"covg_norm/{dataset}/batch_{batch}.npy covg_norm/{dataset}/batch_{batch}.bins.tsv.gz"/]
covg --> fit
fit --> models[/"models/models_batch_{batch}.pickle"/]
models --> transform
covg --> transform
transform --> phenotypes[/"phenotypes/latent_phenos.{dataset}.tsv.gz"/]
covg --> inspect
models --> inspect
phenotypes --> inspect
inspect --> modelinfo[/"inspect/inspect.{gene_id}.{dataset}.tsv.gz"/]
classDef step fill:#ffcfff,stroke:#990099
classDef data fill:#d5d5f7,stroke:#2929d6
classDef input fill:#bdecfc,stroke:#0995c3
class setup,binning,coverage,fit,transform,inspect step
class genebininfo,covgnormparams,bins,covg,models,phenotypes,modelinfo data
style phenotypes fill:#88ffcc,stroke:#009957