PI: pcdedon@mit.edu
Author and contact: yfyuan@mit.edu, msdemott@mit.edu
10/04/2025 revised: add test dataset, version number
This foler contains bash and python scripts and bash commands used in the manuscripts entitled "Phosphorothioate DNA modification by BREX type 4 systems in the human gut microbiome" and "PT-seq: A method for metagenomic analysis of phosphorothioate epigenetics in complex microbial communities"
It aims to align reads of an example PT-seq dataset, identify read pileups, and extract sequences including 6 flanking nt at the pileup sites. The package utilizes a simply pipeline from trimming to mapping, sorting and sequencing analysis.
The PTseq pileup analysis pipelien requires only a standard computer with enough RAM to support the in-memory operations.
This package is supported for macOS and Linux. The package has been tested on the following systems:
Linux: Centos 7 macOS: Mojave (10.14.1)
The software/tools below should be installed and added to your system’s PATH so that it can be invoked from the command line.
bbmap v35.85 https://archive.jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/installation-guide/
fastqc v0.11.8 https://github.com/s-andrews/FastQC
bowtie2 v2.4.5 https://github.com/BenLangmead/bowtie2/releases
samtools v1.19.2 https://github.com/samtools/samtools/releases/
bedtools v2.30.0 https://github.com/arq5x/bedtools2/releases
SMARTcleaner-master v1.0 https://github.com/dzhaobio/SMARTcleaner
seqkit v2.1.0 https://bioinf.shenwei.me/seqkit/
MEME-suit v5.3.3 #Please follow the instructions of the MEME suite via its website at http://meme-suite.org
Dependences for MEME-suit v5.3.3
perl v5.18.2 https://www.cpan.org/src/
ghostscript v9.52 https://ghostscript.com/releases/
automake v1.15 https://ftp.gnu.org/gnu/automake/
autoconf v2.69 https://www.gnu.org/software/autoconf/
python v3.5.2 https://www.python.org/downloads/release/python-352/
zlib v1.2.11 https://github.com/jrmwng/zlib-1.2.11
jdk v1.8.0-101 https://www.oracle.com/java/technologies/javase/javase8-archive-downloads.html
zlib v1.2.11 https://github.com/jrmwng/zlib-1.2.11
xz v5.2.3 https://github.com/tukaani-project/xz/releases
lzma v4.32.7 https://sourceforge.net/projects/lzma/
Depending on the environment of the system, the installation time for MEME-suite can be from less than 1 hour to about 3 hours.
-
install the dependences
-
Download the scripts and the demo dataset. Place them in the work directory
-
Keep the demo dataset in work/demo/
-
To deplete human sequence contamination (optional), please download the hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz file at https://zenodo.org/records/1208052 and place it in the work/demo folder with the demo reads.
-
Modify trim.sh with ${path_to_your_bbmap}
-
Run the scripts in the following order:
1) trim reads
# RAM >= 50G is required. For real PT-seq dataset, we recommond thread >= 10.
sh trim.sh demo/demo_1.fastq demo/demo_2.fastq job_demoThe output files for the next step are trimmed reads:
job_demo_R1_final.fastqandjob_demo_R2_final.fastq. The output files also include intermediate .fq files and QC report files.2) map reads to genome, identify pileups and extract sequences at pileup site with 6 flanking nt. input: reference genome, trimmed reads, job name
sh main.sh demo/UYXE01.1.fsa demo/demo_1.fastq demo/demo_2.fastq demoThe output files for the next step are fasta file of pileups before filter:
dome_pileup_dep0.fastaand tab delimited .txt files:dome_pileup_dep0_F.pos.txtanddemo_pileup_dep0_R.pos.txt. The columnns are:scaffold id, chromosome position of pileup, coverage, pileup depth, depth to coverage ratio, sequence (6 flank nt)
3) identify conserved motifs using MEME-suit. Input file
dome_pileup_dep0.fasta. Output: meme/dome_pileup_dep0.txt`sh meme.sh4) merge pileup sites.
sh mergepileup.shThe output file is a tab delimited txt file with columns: scaffold, genome position, coverage, depth of read 2, depth-to-coverage ratio, sequence with 6-nt flanking, strand, motif.
5) calculate the number of pileups.
sh motif_stat.shThe output file is a txt file with motifs such as, CAG, CCA, GATC/GATC, GAAC/GTTC, and corresponding number of modified motif sites.
6) Summarize the number of pileups per gene class.
sh pileup_to_gffClass.shsh summary_geneClass.shThe output file is a table of gene class (CDS, rRNA, tRNA) and corresponding number of total motif sites and modified motif sites.