Skip to content

23andMe/LDZip

Repository files navigation

LDZipMatrix

LDZipMatrix is a suite of tools for compressing and randomly accessing large Linkage Disequilibrium (LD) matrices.

It is designed for workflows where LD matrices are too large to store uncompressed, while still enabling fast, targeted access. Data are stored as flat files, requiring no database server and allowing simple deployment and portability, and support multiple LD metrics (e.g., phased/unphased r/r-square delta, Dprime etc). Common use cases include:

  • retrieving individual LD values between variant pairs (e.g., A vs. B)
  • identifying variants in high LD with a given variant (above a specified threshold)
  • extracting LD submatrices for downstream analyses (e.g., SuSiE, fine-mapping)
  • generating inputs for LocusZoom plots, variant annotation, and related workflows

This repository includes three main components:

  • C++ binary (ldzip) - Compresses plink2 LD matrices into the .ldzip format and supports related operations such as decompression, filtering, and concatenation across chromosomes.

  • R package (LDZipMatrix) - Opens and queries .ldzip files efficiently from R with random access.

  • Nextflow pipeline - Automates whole-genome .ldzip generation (including LD calculation using plink2) by running jobs on small chunks and combining the outputs.


Table of Contents

Installation

C++ Binary

  • The snippet below compiles the ldzip C++ binary and places it in cpp/bin/ldzip.
  • Use the ldzip binary only for compressing PLINK LD matrices.
  • To read existing compressed data, install the R package LDZipMatrix instead.
  • For more details on usage of the C++ binary, please see the C++ documentation.
git clone git@github.com:23andMe/LDZip.git
cd LDZip/
make cpp

R Package

  • The snippet below builds and installs the R package LDZipMatrix.
  • This package is required for random access to compressed matrices in R.
  • You do not need to build the C++ binary to use the R package.
  • Ensure that roxygen2 is installed for documentation and NAMESPACE generation.
  • For more details on the R package, please see the R documentation.
git clone git@github.com:23andMe/LDZip.git
cd LDZip/
make r-package

Nextflow

The Nextflow pipeline automates creation of a whole-genome compressed LD archive by scattering work across chunks and concatenating the resulting outputs


FAQ

  • I already have a .ldzip file and want to query it. What should I do?
    Install the R package and use the R API to fetch LD values and neighboring linked variants. Go to: R Package

  • I have a PLINK LD matrix and want to create a .ldzip file. What should I do?
    Build the C++ ldzip binary and run the compress command. Go to: C++ Binary

  • I have PLINK pgen files and want to build whole-genome .ldzip outputs in a pipeline. What should I do?
    Use the Nextflow workflow. Go to: Nextflow

  • I already have a .ldzip file and want to convert it back to my own format. What should I do?
    Build the C++ ldzip binary and run the decompress command. Go to: C++ Binary


Getting Help / Support

If you find a bug or have a feature request, please open a GitHub Issue in this repository.

When reporting an issue, it is helpful to include:

  • what you were trying to do
  • the command or R code you ran
  • your OS and compiler / R versions
  • a minimal reproducible example, if possible

Security / Disclaimer

This tool is intended for trusted workflows and assumes that input .ldzip files are well-formed and generated by trusted sources. Do not run this tool on untrusted or user-supplied .ldzip files. The parser is optimized for performance and does not perform full defensive validation against maliciously crafted inputs.


Contact

For questions or issues related to LDZipMatrix, please use the GitHub issue tracker or email:
sayantand@23andme.com

About

Repository to ZIP Plink2 LD files and have random access

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors