Skip to content

Reconstructing single-cell lineage history using mitochondrial mutation drift dynamics

Notifications You must be signed in to change notification settings

sankaranlab/mitodrift

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

225 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MitoDrift

pkgdown Docs

MitoDrift logo

MitoDrift reconstructs single-cell lineage trees from mitochondrial DNA (mtDNA) mutations by modeling heteroplasmy drift and measurement noise with a Wright–Fisher hidden Markov Tree (WF-HMT). It applies population genetics principles (genetic drift) to model mtDNA heteroplasmy in single cells in order to reconstruct high-precision lineage trees from single-cell genomics/multiome data. MitoDrift uses expectation-maximization (EM) to obtain maximum-likelihood estimates of drift, mutation, and error rates, then performs phylogenetic MCMC to quantify the uncertainty in tree topology. The primary output is a phylogeny with posterior clade supports and refined tree topologies at different levels of confidence. Inputs can be mtDNA allele counts from any single-cell genomics assays that capture mtDNA variation (e.g., ReDeeM, mtscATAC-seq, MAESTER).

Preprint: Teng Gao*, Chen Weng*, Isaac Johnson, Michael Poeschla, Jonas Gudera, Emily King, Christopher Rouya, Adriana Donovan, Lauren Bourke, Ying Shao, Eladio Marquez, Rahul Tyagi, Leonard I Zon, Jonathan S Weissman, Vijay G Sankaran. Modeling mitochondrial inheritance enables high-precision single-cell lineage tracing in humans. bioRxiv (2026).


Installation

# from a local checkout
if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes")
remotes::install_github("sankaranlab/mitodrift")

For faster MCMC trace IO, install qs2 with TBB:

remotes::install_cran("qs2", type = "source", configure.args = "--with-TBB --with-simd=AVX2")

Quick start

This uses a pL1000 subset dataset packaged under inst/extdata/ (200 cells, 186 variants).

You can run this from a terminal in the package root:

Rscript inst/bin/run_mitodrift_em.R \
  --mut_dat inst/extdata/pL1000_mut_dat.csv \
  --outdir mitodrift_demo \
  --tree_mcmc_iter 5000 \
  --tree_mcmc_chains 4 \
  --tree_mcmc_burnin 1000

Outputs in mitodrift_demo/ include mitodrift_object.rds plus tree files and diagnostics.

Once the run finishes:

library(mitodrift)
mut_dat <- read.csv(
  system.file("extdata", "pL1000_mut_dat.csv", package = "mitodrift")
)
md <- readRDS("mitodrift_demo/mitodrift_object.rds")
phy_trim <- trim_tree(md$tree_annot, conf = 0.5)

plot_phylo_heatmap2(
  phy_trim,
  mut_dat,
  dot_size = 1,
  branch_width = 0.3,
  branch_length = FALSE,
  node_conf = TRUE,
  het_max = 1
)

Input data requirements

Long format (mut_dat)

One row per cell–variant pair, with counts for alternate allele (a) and total depth (d).

cell variant a d
PD45534aj MT_10448_T_C 0 2348
PD45534aj MT_11787_T_C 1462 2000
PD45534aj MT_1244_T_C 2 1500
PD45534ak MT_10448_T_C 5 2100

IMPORTANT: Include rows where a = 0 so that every cell × variant combination is represented (the observation model uses total depth).

Matrix pair format (amat, dmat)

Wide matrices with variant in the first column and cells in subsequent columns. Use the same layout for amat (alt counts) and dmat (total depth).

Inference settings & diagnostics

Full lineage inference pipeline is run with inst/bin/run_mitodrift_em.R.

Model parameters (EM fitting)

  • --fit_params: Enable automatic parameter fitting via EM (default: TRUE)
  • --fit_param_max_iter: Maximum EM iterations (default: 100)
  • --fit_param_epsilon: EM convergence threshold on parameter changes (default: 1e-3)
  • --k: Number of VAF bins for discretizing heteroplasmy levels (default: 20)
  • --npop: Population size for Wright-Fisher model (default: 600)
  • --eps: Mutation rate per branch (default: 0.001; auto-fitted if fit_params=TRUE)
  • --err: Variant detection error rate (default: 0; auto-fitted if fit_params=TRUE)
  • --ngen: Number of WF generations (default: 100; auto-fitted if fit_params=TRUE)

Phylogenetic MCMC settings

  • --tree_mcmc_chains: Number of independent MCMC runs (default: 1; recommended 10-50 for robust inference. For large trees e.g., >8000 cells, use less chains to avoid memory limits)
  • --tree_mcmc_iter: Maximum iterations per chain (default: 100; can be overridden by automatic termination via ASDSF convergence check)
  • --conv_thres: ASDSF threshold for MCMC termination (default: NULL; e.g., 0.05-0.1 for auto-convergence). ASDSF (Average Standard Deviation of Split Frequencies) summarizes topology agreement across chains. A lower value indicates good mixing. Values < 0.05 indicate good convergence; < 0.1 is acceptable for exploratory analyses or large trees.
  • --tree_mcmc_batch_size: Iteration interval for saving traces and checking convergence (default: 1000)
  • --tree_mcmc_burnin: Initial samples to discard from each chain (default: 0; recommended 10-20% of iterations)

Parallelization options

  • --ncores: Default number of cores for likelihood computation during tree optimization (default: 1)
  • --ncores_em: Cores for parallel EM parameter fitting across variants (default: 1)
  • --ncores_nj: Cores for initial NJ tree distance matrix computation via parallelDist (default: 1)
    • NOTE: Different ncores_nj values may yield slightly different NJ topologies due to floating-point rounding
  • --ncores_qs: Cores for parallel file I/O (qs2 compression) during MCMC and annotation (default: 1)
  • --ncores_annot: Cores for computing clade frequencies from MCMC traces (default: same as --ncores)

Resuming interrupted job

  • Use --resume TRUE to continue interrupted runs from saved traces

Analysis workflow

For a complete walkthrough, see the Analysis Workflow tutorial.

About

Reconstructing single-cell lineage history using mitochondrial mutation drift dynamics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors