snakemake workflow comparing Genrich and MACS2

public public 1yr ago 0 bookmarks

snakemake workflow comparing Genrich and MACS2

workflow of the pipeline

Dependencies

git clone the repo on July 22th

git clone https://github.com/jsh58/Genrich
cd Genrich
make

macs2 2.1.2

conda install macs2

How to run it

ssh bio1
## start a screen session
screen
git clone https://github.com/crazyhottommy/Genrich_compare
## make samples.json file
python sample2json.py /path/to/fastq/dir meta.txt

The meta.txt file should be a 4-column tab delimited tsv file.

e.g. for ATACseq data:

cat meta.txt
sample fastq factor read
SRR891269 SRR891269_GSM1155958_GM12878_ATACseq_50k_Rep2_Homo_sapiens_OTHER_1.fastq.gz atac R1
SRR891269 SRR891269_GSM1155958_GM12878_ATACseq_50k_Rep2_Homo_sapiens_OTHER_2.fastq.gz atac R2
SRR891270 SRR891270_GSM1155959_GM12878_ATACseq_50k_Rep3_Homo_sapiens_OTHER_1.fastq.gz atac R1
SRR891270 SRR891270_GSM1155959_GM12878_ATACseq_50k_Rep3_Homo_sapiens_OTHER_2.fastq.gz atac R2
SRR891271 SRR891271_GSM1155960_GM12878_ATACseq_50k_Rep4_Homo_sapiens_OTHER_1.fastq.gz atac R1
SRR891271 SRR891271_GSM1155960_GM12878_ATACseq_50k_Rep4_Homo_sapiens_OTHER_2.fastq.gz atac R2
SRR891272 SRR891272_GSM1155961_GM12878_ATACseq_500_Rep1_Homo_sapiens_OTHER_1.fastq.gz atac R1
SRR891272 SRR891272_GSM1155961_GM12878_ATACseq_500_Rep1_Homo_sapiens_OTHER_2.fastq.gz atac R2
SRR891273 SRR891273_GSM1155962_GM12878_ATACseq_500_Rep2_Homo_sapiens_OTHER_1.fastq.gz atac R1
SRR891273 SRR891273_GSM1155962_GM12878_ATACseq_500_Rep2_Homo_sapiens_OTHER_2.fastq.gz atac R2
SRR891274 SRR891274_GSM1155963_GM12878_ATACseq_500_Rep3_Homo_sapiens_OTHER_1.fastq.gz atac R1
SRR891274 SRR891274_GSM1155963_GM12878_ATACseq_500_Rep3_Homo_sapiens_OTHER_2.fastq.gz atac R2

for ChIPseq data with single-end reads

sample fastq_name factor reads
C57BL SRR534669.fastq.gz Per2 R1
C57BL SRR534670.fastq.gz Per2 R1
C57BL SRR534671.fastq.gz Per2 R1
C57BL SRR534672.fastq.gz Per2 R1
C57BL SRR534673.fastq.gz Per2 R1
C57BL SRR534674.fastq.gz Per2 R1
C57BL SRR534676.fastq.gz Input R1

for ChIPseq data with paired-end reads:

sample fastq_name factor reads
LKR10 Input_LKR10_1.fq.gz Input R1
LKR10 Input_LKR10_2.fq.gz Input R2
V6_5 Input_V6_5_1.fq.gz Input R1
V6_5 Input_V6_5_2.fq.gz Input R2
LKR10 Mll4_LKR10_1.fq.gz Mll4 R1
LKR10 Mll4_LKR10_2.fq.gz Mll4 R2
LKR10 Per2_A_LKR10_1.fq.gz Per2A R1
LKR10 Per2_A_LKR10_2.fq.gz Per2A R2
LKR10 Per2_B_LKR10_1.fq.gz Per2B R1
LKR10 Per2_B_LKR10_2.fq.gz Per2B R2
V6_5 Per2_B_V6_5_1.fq.gz Per2 R1
V6_5 Per2_B_V6_5_2.fq.gz Per2 R2

The config.yaml file contains various parameters one need to change. Now it can handle single-end or pair-end fastqs; ChIPseq with or without Input/IgG controls and ATACseq data.

## whether it is from fastq files or from bam files
from_fastq: True
## the reads are paired end or single end
paired_end: True
# >70bp long, <70bp false
long_reads: True
# what's the control for IP
# if it is ChIP-seq data
# this is the same as the third colmn of the meta.txt factor column
control: 'Input'
## if it is ChIP-seq data without Input/IgG control or ATACseq data (without control by nature)
## set control : False
## the reference fasta path
ref_fa: "/n/regal/informatics_public/reference_genome_by_tommy/ucsc/mm9/mm9.fa"
macs2_g: mm
macs2_pvalue: 1e-5
# other macs2 arguments
macs2_args: -f BAM
Genrich_args: "-r -p 0.05 -a 200 "
#number of reads downsample to, I set to 50 million, if reads number smaller than
## 50 million, downsample will keep the orignal reads.
## if downsample set to False, no downsample will be done, symbolic link will be used
downsample: False
target_reads: 50000000
## set up the config.yaml file
nano config.yaml
## dry run
snakemake -np
## real run
snakemake -j 24
## submit to slurm
./pyflow_Genrich_compare.sh

To do

  • [ ] Adatpor trimming

  • [ ] Add benchmark directive for logging time and memory usage.

  • [ ] Genrich with replicates https://github.com/jsh58/Genrich#multiple-replicates

  • [ ] IDR framework https://github.com/kundajelab/idr compare it with Genrich. IDR is fast.

  • [ ] conda env and docker container

Code Snippets

107
108
109
110
111
shell:
    """
    gunzip -c {input.r1} | gzip > {output[0]} 2> {log}
    gunzip -c {input.r2} | gzip > {output[1]} 2>> {log}
    """
SnakeMake From line 107 of master/Snakefile
120
shell: "gunzip -c {input} | gzip > {output} 2> {log}"
SnakeMake From line 120 of master/Snakefile
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
run:
    if config["long_reads"]:
        shell(
            r"""
            bwa mem -t 5 -M -v 1 -R '{params.rg}' {config[ref_fa]} {input[0]} {input[1]} 2> {log.bwa} \
            | samblaster 2> {log.markdup} \
            | samtools view -Sb -F 4 - \
            | samtools sort -m 2G -@ 5 -T {output[0]}.tmp -o {output[0]}
            samtools index {output[0]}
            """)
    ## short reads < 70bp
    ## Probably one of the most important is how many mismatches you will allow between a read and a potential mapping location for that location to be considered a match.
    ## The default is 4% of the read length, but you can set this to be either another proportion of the read length, or a fixed integer
    else:
        shell(
            r"""
            bwa aln -t 5 {config[ref_fa]} {input[0]} 2> {log.bwa} > 03aln/{wildcards.sample}_R1.sai
            bwa aln -t 5 {config[ref_fa]} {input[1]} 2>> {log.bwa} > 03aln/{wildcards.sample}_R2.sai
            bwa sampe -s -r '{params.rg}' {config[ref_fa]} 03aln/{wildcards.sample}_R1.sai 03aln/{wildcards.sample}_R2.sai {input[0]} {input[1]} 2>> {log.bwa} \
            | samblaster 2> {log.markdup} \
            | samtools view -Sb -F 4 - \
            | samtools sort -m 2G -@ 5 -T {output[0]}.tmp -o {output[0]}
            rm 03aln/{wildcards.sample}_R1.sai 03aln/{wildcards.sample}_R2.sai
            samtools index {output[0]}
            """)
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
run:
    if config["long_reads"]:
        shell(
            r"""
            bwa mem -t 5 -M -v 1 -R '{params.rg}' {config[ref_fa]} {input} 2> {log.bwa} \
            | samblaster -r 2> {log.markdup} \
            | samtools view -Sb -F 4 - \
            | samtools sort -m 2G -@ 5 -T {output[0]}.tmp -o {output[0]}
            samtools index {output[0]}
            """)
    ## short reads < 70bp
    else:
        shell(
            r"""
            bwa aln -t 5 {config[ref_fa]} {input} 2> {log.bwa} > 03aln/{wildcards.sample}.sai
            bwa samse -r '{params.rg}' {config[ref_fa]} 03aln/{wildcards.sample}.sai {input} 2>> {log.bwa} \
            | samblaster -r 2> {log.markdup} \
            | samtools view -Sb -F 4 - \
            | samtools sort -m 2G -@ 5 -T {output[0]}.tmp -o {output[0]}
            rm 03aln/{wildcards.sample}.sai
            samtools index {output[0]}
            """)
229
230
231
232
shell:
    """
    samtools flagstat {input} > {output} 2> {log}
    """
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
run:
    import re
    import subprocess
    with open (input[2], "r") as f:
        # fifth line contains the number of mapped reads
        line = f.readlines()[4]
        match_number = re.match(r'(\d.+) \+.+', line)
        total_reads = int(match_number.group(1))
    if config["downsample"]:
        target_reads = config["target_reads"] # 15million reads  by default, set up in the config.yaml file
        if total_reads > target_reads:
            down_rate = target_reads/total_reads
        else:
            down_rate = 1
        shell("sambamba view -f bam -t 5 --subsampling-seed=3 -s {rate} {inbam} | samtools sort -m 2G -@ 5 -T {outbam}.tmp > {outbam} 2> {log}".format(rate = down_rate, inbam = input[0], outbam = output[0], log = log))
        shell("samtools index {outbam}".format(outbam = output[0]))
    else:
        shell("ln -s {inbam} {outbam}".format(inbam = params.source_dir + "/" + input[0], outbam = output[0]))
        shell("ln -s {inbai} {outbai}".format(inbai = params.source_dir + "/" + input[1], outbai = output[1]))
269
270
271
272
shell:
    """
    bamCoverage -b {input[0]} --normalizeUsing RPKM --binSize 30 --smoothLength 300 -p 5 --extendReads 200 -o {output} 2> {log}
    """
285
286
287
288
289
290
291
292
shell:
    """
   ## for macs2, when nomodel is set, --extsize is default to 200bp, this is the same as 2 * shift-size in macs14.
    source activate macs2
    macs2 callpeak -t {input.case} \
        -c {input.control} -g {config[macs2_g]} \
        --outdir 09peak_macs2 -n {params.name} -p {config[macs2_pvalue]} {params.custom} &> {log}
    """
303
304
305
306
307
308
309
310
shell:
    """
   ## for macs2, when nomodel is set, --extsize is default to 200bp, this is the same as 2 * shift-size in macs14.
    source activate macs2
    macs2 callpeak -t {input.case} \
        -g {config[macs2_g]} \
        --outdir 09peak_macs2 -n {params.name} -p {config[macs2_pvalue]} {params.custom} &> {log}
    """
322
323
324
325
326
327
shell:
    """
    samtools sort -n -m 2G -@ {threads} -T {wildcards.sample} \
    -o {output} \
    {input[0]} 2> {log}
    """
339
340
341
342
343
shell:
    """
    {config[Genrich_path]} -t {input.case} \
    -c {input.control} {params.custom} -o {output} &> {log}
    """
SnakeMake From line 339 of master/Snakefile
354
355
356
357
358
shell:
    """
    {config[Genrich_path]} -t {input.case} \
    {params.custom} -o {output} &> {log}
    """
SnakeMake From line 354 of master/Snakefile
ShowHide 8 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/crazyhottommy/Genrich_compare
Name: genrich_compare
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...