Skip to content

dedonlab/PTseq_data_analysis

Repository files navigation

Copy right and Contact

PI: pcdedon@mit.edu

Author and contact: yfyuan@mit.edu, msdemott@mit.edu

Edit log

10/04/2025 revised: add test dataset, version number

Overview

This foler contains bash and python scripts and bash commands used in the manuscripts entitled "Phosphorothioate DNA modification by BREX type 4 systems in the human gut microbiome" and "PT-seq: A method for metagenomic analysis of phosphorothioate epigenetics in complex microbial communities"

It aims to align reads of an example PT-seq dataset, identify read pileups, and extract sequences including 6 flanking nt at the pileup sites. The package utilizes a simply pipeline from trimming to mapping, sorting and sequencing analysis.

Hardware requirements

The PTseq pileup analysis pipelien requires only a standard computer with enough RAM to support the in-memory operations.

OS Requirements

This package is supported for macOS and Linux. The package has been tested on the following systems:

Linux: Centos 7 macOS: Mojave (10.14.1)

Installation and dependences

The software/tools below should be installed and added to your system’s PATH so that it can be invoked from the command line.

bbmap v35.85 https://archive.jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/installation-guide/

fastqc v0.11.8 https://github.com/s-andrews/FastQC

bowtie2 v2.4.5 https://github.com/BenLangmead/bowtie2/releases

samtools v1.19.2 https://github.com/samtools/samtools/releases/

bedtools v2.30.0 https://github.com/arq5x/bedtools2/releases

SMARTcleaner-master v1.0 https://github.com/dzhaobio/SMARTcleaner

seqkit v2.1.0 https://bioinf.shenwei.me/seqkit/

The installation time for tools above should be no more than 1 hour.

MEME-suit v5.3.3 #Please follow the instructions of the MEME suite via its website at http://meme-suite.org

Dependences for MEME-suit v5.3.3

perl v5.18.2 https://www.cpan.org/src/

ghostscript v9.52 https://ghostscript.com/releases/

automake v1.15 https://ftp.gnu.org/gnu/automake/

autoconf v2.69 https://www.gnu.org/software/autoconf/

python v3.5.2 https://www.python.org/downloads/release/python-352/

zlib v1.2.11 https://github.com/jrmwng/zlib-1.2.11

jdk v1.8.0-101 https://www.oracle.com/java/technologies/javase/javase8-archive-downloads.html

zlib v1.2.11 https://github.com/jrmwng/zlib-1.2.11

xz v5.2.3 https://github.com/tukaani-project/xz/releases

lzma v4.32.7 https://sourceforge.net/projects/lzma/

Depending on the environment of the system, the installation time for MEME-suite can be from less than 1 hour to about 3 hours.

Usage

  1. install the dependences

  2. Download the scripts and the demo dataset. Place them in the work directory

  3. Keep the demo dataset in work/demo/

  4. To deplete human sequence contamination (optional), please download the hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz file at https://zenodo.org/records/1208052 and place it in the work/demo folder with the demo reads.

  5. Modify trim.sh with ${path_to_your_bbmap}

  6. Run the scripts in the following order:

    1) trim reads

    # RAM >= 50G is required. For real PT-seq dataset, we recommond thread >= 10.

    sh trim.sh demo/demo_1.fastq demo/demo_2.fastq job_demo

    The output files for the next step are trimmed reads: job_demo_R1_final.fastq and job_demo_R2_final.fastq. The output files also include intermediate .fq files and QC report files.

    2) map reads to genome, identify pileups and extract sequences at pileup site with 6 flanking nt. input: reference genome, trimmed reads, job name

    sh main.sh demo/UYXE01.1.fsa demo/demo_1.fastq demo/demo_2.fastq demo

    The output files for the next step are fasta file of pileups before filter: dome_pileup_dep0.fasta and tab delimited .txt files: dome_pileup_dep0_F.pos.txt and demo_pileup_dep0_R.pos.txt. The columnns are:

    scaffold id, chromosome position of pileup, coverage, pileup depth, depth to coverage ratio, sequence (6 flank nt)

    3) identify conserved motifs using MEME-suit. Input file dome_pileup_dep0.fasta. Output: meme/dome_pileup_dep0.txt`

    sh meme.sh

    4) merge pileup sites.

    sh mergepileup.sh

    The output file is a tab delimited txt file with columns: scaffold, genome position, coverage, depth of read 2, depth-to-coverage ratio, sequence with 6-nt flanking, strand, motif.

    5) calculate the number of pileups.

    sh motif_stat.sh

    The output file is a txt file with motifs such as, CAG, CCA, GATC/GATC, GAAC/GTTC, and corresponding number of modified motif sites.

    6) Summarize the number of pileups per gene class.

    sh pileup_to_gffClass.sh
    sh summary_geneClass.sh

    The output file is a table of gene class (CDS, rRNA, tRNA) and corresponding number of total motif sites and modified motif sites.

About

Pipeline for PT-seq data analysis

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages