Skip to content

omicsEye/massSight

Repository files navigation

massSight

massSight implements probabilistic drift-aware optimal transport for cross-study LC–MS feature matching.

Install

Install using uv:

uv add mass-sight

Or using pip:

pip install mass-sight

Core dependencies include numpy, pandas, scikit-learn, and scipy.

Quickstart

Input feature tables

massSight expects two feature tables with at least:

  • MZ (m/z)
  • RT (retention time in minutes)
  • Intensity (a per-feature intensity summary; optional)

MassSightConfig and column names

massSight is configured via MassSightConfig. By default it expects canonical column names (MZ, RT, Intensity).

If your tables use different column names, either rename your columns or pass a schema override:

from mass_sight import MassSightConfig

cfg = MassSightConfig(mz_col="mz", rt_col="rt_min", intensity_col="area")

If study_a and study_b use different schemas, use study-specific overrides:

cfg = MassSightConfig(
    mz_col_study_a="mz",
    rt_col_study_a="rt_min",
    mz_col_study_b="m_over_z",
    rt_col_study_b="rt",
)

For the CLI, use --mz-col, --rt-col, and --intensity-col.

Python usage

import pandas as pd
from mass_sight import MassSightConfig, match_features

study_a = pd.read_csv("study_a.csv")
study_b = pd.read_csv("study_b.csv")

cfg = MassSightConfig()
res = match_features(study_a, study_b, cfg)

top1 = res.top1 # id1, id2, decision, margin, prob_raw, ...
candidates = res.candidates  # residuals, log-likelihoods, OT weights, etc.

Command-line usage

mass_sight match study_a.csv study_b.csv \
  --out-candidates candidates.csv \
  --out-top1 top1.csv

For public-data reuse from Metabolomics Workbench:

mass_sight find --out selection.json

mass_sight reuse \
  --analysis-id AN000001 \
  --analysis-id AN000002 \
  --out-dir reuse_out

find launches an interactive terminal UI to browse Workbench by disease and export a reproducible selection manifest of analysis_ids. You can then pass those IDs to reuse (or script against selection.json).

This workflow is designed for end users who want a quick, reproducible cross-study run from public MW IDs. It fetches mwtab + untarg_data, summarizes study metadata, automatically groups compatible assays (same ion mode + chromatography), and runs clustering within each group.

Common options:

  • --use-intensity off|auto|on: control whether intensity is used across studies (default off).
  • --min-group-size N: require at least N studies per assay-compatible group (default 2).
  • --allow-unknown-strata: include analyses with unknown ion mode/chromatography instead of dropping them.
  • --fetch-targeted-data: also fetch study-level named-metabolite matrices (/study_id/.../data) for targeted/meta-analysis (default disabled).

By default, CLI runs also write a machine-readable run manifest capturing software version, parameters, runtime, and outputs:

  • match: <out-candidates-stem>.run_manifest.json
  • cluster: <out-dir>/run_manifest.json
  • reuse: <out-dir>/run_manifest.json

Citation

  • See CITATION.cff.

License

MIT.

About

probablistic drift-aware optimal transport for cross-study LC-MS feature matching

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages