🧠 PathBench: Evaluating Vision and Pathology Foundation Models for Computational Pathology

A Comprehensive Benchmark Study

👥 Authors

Rohan Bareja¹, Francisco Carrillo-Perez¹, Yuanning Zheng¹, Marija Pizurica¹
Tarak Nath Nandi², Lu Tian³,Jeanne Shen⁴, Ravi Madduri², Olivier Gevaert¹

¹Stanford Center for Biomedical Informatics Research (BMIR), Stanford University, School of Medicine
²Data Science and Learning Division, Argonne National Laboratory
³Department of Biomedical Data Science, Stanford University, School of Medicine, Stanford, CA, USA ⁴Department of Pathology, Stanford University, School of Medicine

📝 Abstract

To advance precision medicine in pathology, robust AI-driven foundation models are increasingly needed to uncover complex patterns in large-scale pathology datasets, enabling more accurate disease detection, classification, and prognostic insights. However, despite substantial progress in deep learning and computer vision, the comparative performance and generalizability of these pathology foundation models across diverse histopathological datasets and tasks remain largely unexamined. In this study, we conduct a comprehensive benchmarking of 32 AI foundation models for computational pathology across 4 model categories - including general vision models (VM), general vision-language models (VLM), pathology-specific vision models (Path-VM), and pathology-specific vision-language models (Path-VLM), evaluated over 41 slide-level and patch-level tasks sourced from TCGA, CPTAC, external benchmarking datasets, and out-of-domain datasets. Across TCGA tasks, several pathology-specific vision models consistently ranked among the top performers, including Virchow2, Prov-GigaPath, H-optimus-0, and UNI. Extending evaluation to CPTAC and out-of-domain datasets revealed more nuanced generalization behavior, with relative model rankings exhibiting modest but consistent shifts across datasets and task categories. Pairwise statistical comparisons indicated that performance differences among the top models were often small and task dependent, highlighting broadly comparable performance rather than a single dominant model. We also show that Path-VM outperformed Path-VLM and showed competitive performance relative to VM, securing top rankings across tasks despite lacking a statistically significant edge over vision models (VM). Analyses of model and dataset scaling showed that increasing model size or pretraining dataset size did not consistently translate into improved downstream performance, particularly outside TCGA benchmarks. Finally, we demonstrate that model ensembling using a late fusion approach (fusion model), which combines predictions from multiple top-performing foundation models, yields improved aggregate performance across external datasets and tissue types, underscoring the complementary strengths learned by different models. Together, these results emphasize that generalization in computational pathology is heterogeneous and task dependent, and that factors beyond scale alone likely contribute to downstream performance, motivating further work to better understand and improve robustness across diverse tissues, datasets, and clinical settings.

🔬 Overview

This repository accompanies our paper:

Evaluating Vision and Pathology Foundation Models for Computational Pathology: A Comprehensive Benchmark Study
medRxiv preprint, May 2025

We benchmark 32 foundation models across 41 computational pathology tasks, including:

General-purpose Vision Models (VM)
Vision-Language Models (VLM)
Pathology-specific Vision Models (Path-VM)
Pathology-specific Vision-Language Models (Path-VLM)

We evaluate performance across data from TCGA, CPTAC, and several external out-of-domain datasets. Tasks include tumor classification, molecular subtyping, tumor stage, and pathway prediction.

📊 PathBench

You can explore the complete benchmark results interactively via our web portal:

PathBench

📈 Key Findings

Several pathology-specific vision foundation models consistently ranked among the top performers, including Virchow2, Prov-GigaPath, H-optimus-0, UNI, and UNI2, across a large benchmark of 41 pathology tasks spanning TCGA, CPTAC, external benchmarks, and out-of-domain datasets.
Pathology-specific vision models (Path-VM) showed a clear advantage over pathology vision–language models (Path-VLM), while demonstrating performance comparable to general vision models (VM), with differences across model families varying by task and dataset.
Model size and reported pretraining dataset size alone did not consistently explain downstream performance, suggesting that additional factors such as training data composition, tissue diversity, and architectural design likely influence model generalization.
Explicit separation of TCGA and non-TCGA evaluations revealed heterogeneous generalization behavior across datasets, highlighting the importance of evaluating pathology foundation models beyond in-domain benchmarks.
Model ensembling using a late fusion strategy improved aggregate performance across tasks and datasets, indicating that different foundation models capture complementary visual representations.

📁 Repository Structure

.
├── dashboard/              # Dashboard code (e.g., Streamlit app)
├── data/                   # Data used for the dashboard (summaries, plots, results)
├── environments/           # Conda environment YAML files
│   └── linear_eval.yml     # Recommended environment for model evaluation
├── models/                 # Vision transformer model code
├── scripts/                # Linear evaluation scripts for benchmarking
├── README.md               # Project overview and setup instructions

🚀 Getting Started

⚙️ System Requirements

Operating system(s) tested: Linux (tested on SUSE Linux Enterprise Server 15 SP6; expected to run on other modern Linux distributions such as Ubuntu 22.04 or CentOS 7)
Dependencies: fully specified in environments/linear_eval.yml
Hardware: standard x86_64 CPU; GPU recommended for faster model evaluation
Typical install time for conda environment on a "normal" desktop computer: ~10–15 minutes

Clone the repository

git clone https://github.com/gevaertlab/benchmarking-path-models.git
cd benchmarking-path-models

Set up the Conda environment We recommend using the provided Conda environment for reproducibility:

conda env create -f environments/linear_eval.yml
conda activate linear_eval

Patch extraction To extract patches from whole-slide images (WSIs), please use the script src/patch_gen_hdf5.py. An example script to run the patch extraction: src/submit_patch_gen_hdf5.sh
Example: Run evaluation script for UNI

python -m torch.distributed.launch \
  --master_port $RANDOM \
  --nproc_per_node=4 \
  /home/rbareja/dino/eval_linear_uni.py \
  --patch_data_path _Patches256x256_hdf5/ \
  --train_csv_path ../tcga_cancer_metadata/brain_meta/tcga_ref_brain_IDHmut_train_fold0.csv \
  --val_csv_path ../tcga_cancer_metadata/brain_meta/tcga_ref_brain_IDHmut_val_fold0.csv \
  --test_csv_path ../tcga_cancer_metadata/brain_meta/tcga_ref_brain_IDHmut_test.csv \
  --no_aug \
  --img_size=256 \
  --max_patches_total=500 \
  --bag_size=50 \
  --test_max_patches_total=500 \
  --test_bag_size=500 \
  --output_dir ../eval_brain/IDHmut_classification/"$out_dir"/ \
  --train_from_scratch no \
  --num_workers=2 \
  --batch_size_per_gpu 16 \
  --test_batch_size_per_gpu 2 \
  --num_labels 2 \
  --arch "$arch" \
  --patch_size="$p_size" \
  --epochs 30 \
  --evaluate \
  --pretrained_weights "$p_weights" \
  > ../eval_brain/IDHmut_classification/"$out_dir"/logtesdata.txt

💻 Computational Requirements

Model inference time depends on the model, task, and dataset size.
- Typically, evaluation of a single model takes a couple of hours on a standard desktop with 4 GPUs or a moderately-sized CPU cluster.
Scripts support parallel evaluation via PyTorch Distributed for multi-GPU setups.
Exact runtime may vary depending on hardware, batch size, and data preprocessing.

📖 Citation

If you use this work in your research, please cite our preprint:

Bareja R, Carrillo-Perez F, Zheng Y, Pizurica M, Nandi TN, Shen J, Madduri R, Gevaert O. Evaluating Vision and Pathology Foundation Models for Computational Pathology: A Comprehensive Benchmark Study. medRxiv, 2025. https://doi.org/10.1101/2025.05.08.25327250

📄 License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 PathBench: Evaluating Vision and Pathology Foundation Models for Computational Pathology

👥 Authors

📝 Abstract

🔬 Overview

📊 PathBench

📈 Key Findings

📁 Repository Structure

🚀 Getting Started

⚙️ System Requirements

💻 Computational Requirements

📖 Citation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
dashboard		dashboard
data		data
environments		environments
examples/brain_meta		examples/brain_meta
models		models
scripts		scripts
src		src
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🧠 PathBench: Evaluating Vision and Pathology Foundation Models for Computational Pathology

👥 Authors

📝 Abstract

🔬 Overview

📊 PathBench

📈 Key Findings

📁 Repository Structure

🚀 Getting Started

⚙️ System Requirements

💻 Computational Requirements

📖 Citation

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages