snakemake workflow comparing Genrich and MACS2

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic
Lack of a description for a new keyword MACS2 .

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

snakemake workflow comparing Genrich and MACS2

workflow of the pipeline

Dependencies

bwa/0.7.15 for aligning the reads.
Genrich

git clone the repo on July 22th

git clone https://github.com/jsh58/Genrich
cd Genrich
make

MACS2

macs2 2.1.2

conda install macs2

How to run it

ssh bio1
## start a screen session
screen
git clone https://github.com/crazyhottommy/Genrich_compare
## make samples.json file
python sample2json.py /path/to/fastq/dir meta.txt

The meta.txt file should be a 4-column tab delimited tsv file.

e.g. for ATACseq data:

cat meta.txt
sample fastq factor read
SRR891269 SRR891269_GSM1155958_GM12878_ATACseq_50k_Rep2_Homo_sapiens_OTHER_1.fastq.gz atac R1
SRR891269 SRR891269_GSM1155958_GM12878_ATACseq_50k_Rep2_Homo_sapiens_OTHER_2.fastq.gz atac R2
SRR891270 SRR891270_GSM1155959_GM12878_ATACseq_50k_Rep3_Homo_sapiens_OTHER_1.fastq.gz atac R1
SRR891270 SRR891270_GSM1155959_GM12878_ATACseq_50k_Rep3_Homo_sapiens_OTHER_2.fastq.gz atac R2
SRR891271 SRR891271_GSM1155960_GM12878_ATACseq_50k_Rep4_Homo_sapiens_OTHER_1.fastq.gz atac R1
SRR891271 SRR891271_GSM1155960_GM12878_ATACseq_50k_Rep4_Homo_sapiens_OTHER_2.fastq.gz atac R2
SRR891272 SRR891272_GSM1155961_GM12878_ATACseq_500_Rep1_Homo_sapiens_OTHER_1.fastq.gz atac R1
SRR891272 SRR891272_GSM1155961_GM12878_ATACseq_500_Rep1_Homo_sapiens_OTHER_2.fastq.gz atac R2
SRR891273 SRR891273_GSM1155962_GM12878_ATACseq_500_Rep2_Homo_sapiens_OTHER_1.fastq.gz atac R1
SRR891273 SRR891273_GSM1155962_GM12878_ATACseq_500_Rep2_Homo_sapiens_OTHER_2.fastq.gz atac R2
SRR891274 SRR891274_GSM1155963_GM12878_ATACseq_500_Rep3_Homo_sapiens_OTHER_1.fastq.gz atac R1
SRR891274 SRR891274_GSM1155963_GM12878_ATACseq_500_Rep3_Homo_sapiens_OTHER_2.fastq.gz atac R2

for ChIPseq data with single-end reads

sample fastq_name factor reads
C57BL SRR534669.fastq.gz Per2 R1
C57BL SRR534670.fastq.gz Per2 R1
C57BL SRR534671.fastq.gz Per2 R1
C57BL SRR534672.fastq.gz Per2 R1
C57BL SRR534673.fastq.gz Per2 R1
C57BL SRR534674.fastq.gz Per2 R1
C57BL SRR534676.fastq.gz Input R1

for ChIPseq data with paired-end reads:

sample fastq_name factor reads
LKR10 Input_LKR10_1.fq.gz Input R1
LKR10 Input_LKR10_2.fq.gz Input R2
V6_5 Input_V6_5_1.fq.gz Input R1
V6_5 Input_V6_5_2.fq.gz Input R2
LKR10 Mll4_LKR10_1.fq.gz Mll4 R1
LKR10 Mll4_LKR10_2.fq.gz Mll4 R2
LKR10 Per2_A_LKR10_1.fq.gz Per2A R1
LKR10 Per2_A_LKR10_2.fq.gz Per2A R2
LKR10 Per2_B_LKR10_1.fq.gz Per2B R1
LKR10 Per2_B_LKR10_2.fq.gz Per2B R2
V6_5 Per2_B_V6_5_1.fq.gz Per2 R1
V6_5 Per2_B_V6_5_2.fq.gz Per2 R2

The config.yaml file contains various parameters one need to change. Now it can handle single-end or pair-end fastqs; ChIPseq with or without Input/IgG controls and ATACseq data.

## whether it is from fastq files or from bam files
from_fastq: True
## the reads are paired end or single end
paired_end: True
# >70bp long, <70bp false
long_reads: True
# what's the control for IP
# if it is ChIP-seq data
# this is the same as the third colmn of the meta.txt factor column
control: 'Input'
## if it is ChIP-seq data without Input/IgG control or ATACseq data (without control by nature)
## set control : False
## the reference fasta path
ref_fa: "/n/regal/informatics_public/reference_genome_by_tommy/ucsc/mm9/mm9.fa"
macs2_g: mm
macs2_pvalue: 1e-5
# other macs2 arguments
macs2_args: -f BAM
Genrich_args: "-r -p 0.05 -a 200 "
#number of reads downsample to, I set to 50 million, if reads number smaller than
## 50 million, downsample will keep the orignal reads.
## if downsample set to False, no downsample will be done, symbolic link will be used
downsample: False
target_reads: 50000000

## set up the config.yaml file
nano config.yaml
## dry run
snakemake -np
## real run
snakemake -j 24
## submit to slurm
./pyflow_Genrich_compare.sh

To do

[ ] Adatpor trimming
[ ] Add benchmark directive for logging time and memory usage.
[ ] Genrich with replicates https://github.com/jsh58/Genrich#multiple-replicates
[ ] IDR framework https://github.com/kundajelab/idr compare it with Genrich. IDR is fast.
[ ] conda env and docker container

Code Snippets

shell:
    """
    gunzip -c {input.r1} | gzip > {output[0]} 2> {log}
    gunzip -c {input.r2} | gzip > {output[1]} 2>> {log}
    """

SnakeMake From line 107 of master/Snakefile

shell: "gunzip -c {input} | gzip > {output} 2> {log}"

SnakeMake From line 120 of master/Snakefile

run:
    if config["long_reads"]:
        shell(
            r"""
            bwa mem -t 5 -M -v 1 -R '{params.rg}' {config[ref_fa]} {input[0]} {input[1]} 2> {log.bwa} \
            | samblaster 2> {log.markdup} \
            | samtools view -Sb -F 4 - \
            | samtools sort -m 2G -@ 5 -T {output[0]}.tmp -o {output[0]}
            samtools index {output[0]}
            """)
    ## short reads < 70bp
    ## Probably one of the most important is how many mismatches you will allow between a read and a potential mapping location for that location to be considered a match.
    ## The default is 4% of the read length, but you can set this to be either another proportion of the read length, or a fixed integer
    else:
        shell(
            r"""
            bwa aln -t 5 {config[ref_fa]} {input[0]} 2> {log.bwa} > 03aln/{wildcards.sample}_R1.sai
            bwa aln -t 5 {config[ref_fa]} {input[1]} 2>> {log.bwa} > 03aln/{wildcards.sample}_R2.sai
            bwa sampe -s -r '{params.rg}' {config[ref_fa]} 03aln/{wildcards.sample}_R1.sai 03aln/{wildcards.sample}_R2.sai {input[0]} {input[1]} 2>> {log.bwa} \
            | samblaster 2> {log.markdup} \
            | samtools view -Sb -F 4 - \
            | samtools sort -m 2G -@ 5 -T {output[0]}.tmp -o {output[0]}
            rm 03aln/{wildcards.sample}_R1.sai 03aln/{wildcards.sample}_R2.sai
            samtools index {output[0]}
            """)

SnakeMake SAMtools BWA SAMBLASTER From line 159 of master/Snakefile

run:
    if config["long_reads"]:
        shell(
            r"""
            bwa mem -t 5 -M -v 1 -R '{params.rg}' {config[ref_fa]} {input} 2> {log.bwa} \
            | samblaster -r 2> {log.markdup} \
            | samtools view -Sb -F 4 - \
            | samtools sort -m 2G -@ 5 -T {output[0]}.tmp -o {output[0]}
            samtools index {output[0]}
            """)
    ## short reads < 70bp
    else:
        shell(
            r"""
            bwa aln -t 5 {config[ref_fa]} {input} 2> {log.bwa} > 03aln/{wildcards.sample}.sai
            bwa samse -r '{params.rg}' {config[ref_fa]} 03aln/{wildcards.sample}.sai {input} 2>> {log.bwa} \
            | samblaster -r 2> {log.markdup} \
            | samtools view -Sb -F 4 - \
            | samtools sort -m 2G -@ 5 -T {output[0]}.tmp -o {output[0]}
            rm 03aln/{wildcards.sample}.sai
            samtools index {output[0]}
            """)

SnakeMake SAMtools BWA SAMBLASTER From line 199 of master/Snakefile

shell:
    """
    samtools flagstat {input} > {output} 2> {log}
    """

SnakeMake SAMtools From line 229 of master/Snakefile

run:
    import re
    import subprocess
    with open (input[2], "r") as f:
        # fifth line contains the number of mapped reads
        line = f.readlines()[4]
        match_number = re.match(r'(\d.+) \+.+', line)
        total_reads = int(match_number.group(1))
    if config["downsample"]:
        target_reads = config["target_reads"] # 15million reads  by default, set up in the config.yaml file
        if total_reads > target_reads:
            down_rate = target_reads/total_reads
        else:
            down_rate = 1
        shell("sambamba view -f bam -t 5 --subsampling-seed=3 -s {rate} {inbam} | samtools sort -m 2G -@ 5 -T {outbam}.tmp > {outbam} 2> {log}".format(rate = down_rate, inbam = input[0], outbam = output[0], log = log))
        shell("samtools index {outbam}".format(outbam = output[0]))
    else:
        shell("ln -s {inbam} {outbam}".format(inbam = params.source_dir + "/" + input[0], outbam = output[0]))
        shell("ln -s {inbai} {outbai}".format(inbai = params.source_dir + "/" + input[1], outbai = output[1]))

SnakeMake SAMtools Sambamba From line 242 of master/Snakefile

shell:
    """
    bamCoverage -b {input[0]} --normalizeUsing RPKM --binSize 30 --smoothLength 300 -p 5 --extendReads 200 -o {output} 2> {log}
    """

SnakeMake DeepTools From line 269 of master/Snakefile

shell:
    """
   ## for macs2, when nomodel is set, --extsize is default to 200bp, this is the same as 2 * shift-size in macs14.
    source activate macs2
    macs2 callpeak -t {input.case} \
        -c {input.control} -g {config[macs2_g]} \
        --outdir 09peak_macs2 -n {params.name} -p {config[macs2_pvalue]} {params.custom} &> {log}
    """

SnakeMake macs2 From line 285 of master/Snakefile

shell:
    """
   ## for macs2, when nomodel is set, --extsize is default to 200bp, this is the same as 2 * shift-size in macs14.
    source activate macs2
    macs2 callpeak -t {input.case} \
        -g {config[macs2_g]} \
        --outdir 09peak_macs2 -n {params.name} -p {config[macs2_pvalue]} {params.custom} &> {log}
    """

SnakeMake macs2 From line 303 of master/Snakefile

shell:
    """
    samtools sort -n -m 2G -@ {threads} -T {wildcards.sample} \
    -o {output} \
    {input[0]} 2> {log}
    """

SnakeMake SAMtools From line 322 of master/Snakefile

shell:
    """
    {config[Genrich_path]} -t {input.case} \
    -c {input.control} {params.custom} -o {output} &> {log}
    """

SnakeMake From line 339 of master/Snakefile

shell:
    """
    {config[Genrich_path]} -t {input.case} \
    {params.custom} -o {output} &> {log}
    """

SnakeMake From line 354 of master/Snakefile

ShowHide 8 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/crazyhottommy/Genrich_compare

Name: genrich_compare

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

MACS2 macs2 BWA DeepTools Sambamba SAMBLASTER SAMtools Snakemake

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free