ATAC-seq snakemake pipeline

public public 1yr ago Version: v1.0 0 bookmarks

ATACseq pipeline

Understand the method

source: http://www.the-scientist.com/?articles.view/articleNo/44772/title/Reveling-in-the-Revealed/

https://www.biostars.org/p/209592/

references

Ashul lab protocol

ataqv for QC: https://github.com/ParkerLab/ataqv

Peak calling

Peaks of read enrichment were mapped using macs2. For ATAC-Seq libraries, reads mapping to chrM were excluded prior to peak calling:

samtools view -h -F1024 $bam | grep -v -P '\tchrM\t' | samtools view -b - > $tmpBam
macs2 callpeak --keep-dup all -t $tmpBam -n ${bname} 
  • -f 3: only include alignments marked with the SAM flag 3, which means "properly paired and mapped"

  • -F 4: exclude aligned reads with flag 4: the read itself did not map

  • -F 8: exclude aligned reads with flag 8: their mates did not map

  • -F 256: exclude alignments with flag 256, which means that bwa mapped the read to multiple places in the reference genome, and this alignment is not the best

  • -F 1024: exclude alignments marked with SAM flag 1024, which indicates that the read is an optical or PCR duplicate (this flag would be set by Picard)

  • -F 2048: exclude alignments marked with SAM flag 2048, indicating chimeric alignments, where bwa decided that parts of the read mapped to different regions in the
    genome. These records are the individual aligned segments of the read. They usually indicate structural variation. We're not going to base peak calls on them.

motif foot-print

https://sites.google.com/site/atacseqpublic/atac-seq-analysis-methods/offsetmethods

for peak calling, The paper mentioned "all reads aligning to + strand were offset by +4bp, all reads aligning to the - strand are offset -5 bp".

The offsets are only really important when doing TF footprinting using CENTIPEDE. The simplest approach for me was to convert a *bam to *bed (bedtools bamtobed), apply the offset using awk (or whichever method you prefer) and if its useful, convert back to a bam (bedtools bedtobam).

convert bam to bed

bedtools bamtobed -i bowtie_dup_rm.bam > my.bed

Shift the forward reads 4bp and reverse reads 5bp:

awk 'BEGIN {OFS = "\t"} ; {if ($6 == "+") print $1, $2 + 4, $3 + 4, $4, $5, $6; else print $1, $2 - 5, $3 - 5, $4, $5, $6}' my.bed > my_shifted.bed

work flow of the pipeline

please cite

Dependencies

How to distribute workflows

read doc

ssh shark.mdanderson.org
# start a screen session
screen
# make a folder, name it yourself, I named it workdir
mkdir /rsch2/genomic_med/krai/workdir/
cd /rsch2/genomic_med/krai/workdir/
git clone https://gitlab.com/tangming2005/snakemake_ATACseq_pipeline
cd snakemake_ATACseq_pipeline
## edit the config.yaml file as needed, e.g. set mouse or human for ref genome, p value cut off for peak calling
nano config.yaml
## skip this if on Shark, samir has py351 set up for you. see below STEPS
conda create -n snakemake python=3 snakemake
source activate snakemake

STEPS

on Shark

there is a python3 environment set up. just do

source activate py351

create the sample.json file by feeding a fastq folder. this folder should be a folder containing all the samples.

please use the full path for the folder that contains your fastq folders.

python3 sample2json.py --fastq_dir /path/to/the/fastq/

e.g.

KR_PM374/
├── Sample_ATAC-COAD-hCRC-R1-ATAC--NC-ELO-tEDD1aE31Dg
│ ├── ATAC-COAD-hCRC-R1-ATAC--NC-ELO-tEDD1aE31Dg_S4_L001_R1_001.fastq.gz
│ ├── ATAC-COAD-hCRC-R1-ATAC--NC-ELO-tEDD1aE31Dg_S4_L001_R2_001.fastq.gz
│ ├── ATAC-COAD-hCRC-R1-ATAC--NC-ELO-tEDD1aE31Dg_S4_L002_R1_001.fastq.gz
│ ├── ATAC-COAD-hCRC-R1-ATAC--NC-ELO-tEDD1aE31Dg_S4_L002_R2_001.fastq.gz
│ ├── ATAC-COAD-hCRC-R1-ATAC--NC-ELO-tEDD1aE31Dg_S4_L003_R1_001.fastq.gz
│ ├── ATAC-COAD-hCRC-R1-ATAC--NC-ELO-tEDD1aE31Dg_S4_L003_R2_001.fastq.gz
│ ├── ATAC-COAD-hCRC-R1-ATAC--NC-ELO-tEDD1aE31Dg_S4_L004_R1_001.fastq.gz
│ └── ATAC-COAD-hCRC-R1-ATAC--NC-ELO-tEDD1aE31Dg_S4_L004_R2_001.fastq.gz
├── Sample_ATAC-COAD-m280-R1-ATAC--NC-ELO-t6313a3D02g
│ ├── ATAC-COAD-m280-R1-ATAC--NC-ELO-t6313a3D02g_S1_L001_R1_001.fastq.gz
│ ├── ATAC-COAD-m280-R1-ATAC--NC-ELO-t6313a3D02g_S1_L001_R2_001.fastq.gz
│ ├── ATAC-COAD-m280-R1-ATAC--NC-ELO-t6313a3D02g_S1_L002_R1_001.fastq.gz
│ ├── ATAC-COAD-m280-R1-ATAC--NC-ELO-t6313a3D02g_S1_L002_R2_001.fastq.gz
│ ├── ATAC-COAD-m280-R1-ATAC--NC-ELO-t6313a3D02g_S1_L003_R1_001.fastq.gz
│ ├── ATAC-COAD-m280-R1-ATAC--NC-ELO-t6313a3D02g_S1_L003_R2_001.fastq.gz
│ ├── ATAC-COAD-m280-R1-ATAC--NC-ELO-t6313a3D02g_S1_L004_R1_001.fastq.gz
│ └── ATAC-COAD-m280-R1-ATAC--NC-ELO-t6313a3D02g_S1_L004_R2_001.fastq.gz
├── Sample_ATAC-COAD-m320-R1-ATAC--NC-ELO-t7AECa1D99g
│ ├── ATAC-COAD-m320-R1-ATAC--NC-ELO-t7AECa1D99g_S2_L001_R1_001.fastq.gz
│ ├── ATAC-COAD-m320-R1-ATAC--NC-ELO-t7AECa1D99g_S2_L001_R2_001.fastq.gz
│ ├── ATAC-COAD-m320-R1-ATAC--NC-ELO-t7AECa1D99g_S2_L002_R1_001.fastq.gz
│ ├── ATAC-COAD-m320-R1-ATAC--NC-ELO-t7AECa1D99g_S2_L002_R2_001.fastq.gz
│ ├── ATAC-COAD-m320-R1-ATAC--NC-ELO-t7AECa1D99g_S2_L003_R1_001.fastq.gz
│ ├── ATAC-COAD-m320-R1-ATAC--NC-ELO-t7AECa1D99g_S2_L003_R2_001.fastq.gz
│ ├── ATAC-COAD-m320-R1-ATAC--NC-ELO-t7AECa1D99g_S2_L004_R1_001.fastq.gz
│ └── ATAC-COAD-m320-R1-ATAC--NC-ELO-t7AECa1D99g_S2_L004_R2_001.fastq.gz
└── Sample_ATAC-COAD-mNOD-R1-ATAC--NC-ELO-t076AaFFEDg
 ├── ATAC-COAD-mNOD-R1-ATAC--NC-ELO-t076AaFFEDg_S3_L001_R1_001.fastq.gz
 ├── ATAC-COAD-mNOD-R1-ATAC--NC-ELO-t076AaFFEDg_S3_L001_R2_001.fastq.gz
 ├── ATAC-COAD-mNOD-R1-ATAC--NC-ELO-t076AaFFEDg_S3_L002_R1_001.fastq.gz
 ├── ATAC-COAD-mNOD-R1-ATAC--NC-ELO-t076AaFFEDg_S3_L002_R2_001.fastq.gz
 ├── ATAC-COAD-mNOD-R1-ATAC--NC-ELO-t076AaFFEDg_S3_L003_R1_001.fastq.gz
 ├── ATAC-COAD-mNOD-R1-ATAC--NC-ELO-t076AaFFEDg_S3_L003_R2_001.fastq.gz
 ├── ATAC-COAD-mNOD-R1-ATAC--NC-ELO-t076AaFFEDg_S3_L004_R1_001.fastq.gz
 └── ATAC-COAD-mNOD-R1-ATAC--NC-ELO-t076AaFFEDg_S3_L004_R2_001.fastq.gz

naming conventions of the files

regular expression is used to capture the file name.

inside the sample2json.py :

m = re.search(r"-([A-Z]{4})-([A-Z0-9a-z]{4})-[0-9A-Z]{2}-.+_(L[0-9]{3})_(R[12])_[0-9]{3}.fastq.gz", file)
project = m.group(1)
sample = m.group(2)
lane = m.group(3)
# R1 or R2 for forward and reverse read
reads = m.group(4)

check the file information in the json file:

less -S samples.json

dry run to test

## dry run
snakemake -np

if no errors, preceed below.

Using DRMAA

job control through drmaa

DRMAA is only supported on Shark .

module load drmma
./pyflow-drmaa-ATACseq.sh

Using drmaa can control + c to stop the current run.

Dependent jobs are submitted one by one, if some jobs failed, the pipeline will stop. Good for initital testing.

submit all jobs to the cluster

./pyflow-ATACseq.sh 

All jobs will be submitted to the cluster on queue. This is useful if you know your jobs will succeed for most of them and the jobs are on queue to gain priority.

job control

To kill all of your pending jobs you can use the command:

bkill ` bjobs -u krai |grep PEND |cut -f1 -d" "`
 bjobs -pl Display detailed information of all pending jobs of the invoker. bjobs -ps Display only pending and suspended jobs. bjobs -u all -a Display all jobs of all users. bjobs -d -q short -m apple -u mtang1 Display all the recently finished jobs submitted by john to the queue short, and executed on the host apple.

rerun some of the jobs

# specify the name of the rule, all files that associated with that rule will be rerun. e.g. rerun macs2 calling peaks rule,
./pyflow-ChIPseq -R call_peaks_macs2
## rerun one sample, just specify the name of the target file
./pyflow-ATACseq -R 04aln/m280.sorted.bam
##rerun only align rule
./pyflow-ATACseq -f align
## check snakemake -h
## -R -f -F --until are useufl

checking results after run finish

snakemake --summary | sort -k1,1 | less -S
# or detailed summary will give you the commands used to generated the output and what input is used
snakemake --detailed-summary | sort -k1,1 > snakemake_run_summary.txt

clean the folders

I use echo to see what will be removed first, then you can remove all later.

find . -maxdepth 1 -type d -name "[0-9]*" | xargs echo rm -rf

Snakemake does not trigger re-runs if I add additional input files. What can I do?

Snakemake has a kind of “lazy” policy about added input files if their modification date is older than that of the output files. One reason is that information what to do cannot be inferred just from the input and output files. You need additional information about the last run to be stored. Since behaviour would be inconsistent between cases where that information is available and where it is not, this functionality has been encoded as an extra switch. To trigger updates for jobs with changed input files, you can use the command line argument –list-input-changes in the following way:

snakemake -n -R `snakemake --list-input-changes`

How do I trigger re-runs for rules with updated code or parameters?

snakemake -n -R `snakemake --list-params-changes`

and

$ snakemake -n -R `snakemake --list-code-changes`

Peak calling

by Tao Liu :

If you followed original protocol for ATAC-Seq, you should get Paired-End reads. If so, I would suggest you just use "--format BAMPE" to let MACS2 pileup the whole fragments in general. But if you want to focus on looking for where the 'cutting sites' are, then '--nomodel --shift -100 --extsize 200' should work.

Code Snippets

58
59
60
61
62
shell:
	"""
	gunzip -c {input.r1} | gzip > {output[0]} 2> {log}
	gunzip -c {input.r2} | gzip > {output[1]} 2>> {log}
	"""
72
73
74
75
76
77
    shell:
        """
	# fastqc works fine on .gz file as well
        module load fastqc
        fastqc -o 02fqc -f fastq --noextract {input[0]} {input[1]} 2> {log}
        """
86
87
88
89
90
91
92
 	shell:
 		"""
 		# python2.x
		trim_adapters {input[0]} {input[1]} 2> {log}
		mv 01seq/{wildcards.sample}_R1.trimmed.fastq.gz 03trim/
		mv 01seq/{wildcards.sample}_R2.trimmed.fastq.gz 03trim/
 		"""
105
106
107
108
109
110
111
112
shell:
	"""
	## samblaster mark duplicates for read id grouped reads. I do not coordinate sort the bam
	bowtie2 --threads 5  -X2000 -x {config[idx_bt2]} -1 {input[0]} -2 {input[1]} 2> {log.bowtie2} \
	| samblaster 2> {log.markdup} \
	| samtools view -Sb - > {output[0]}

	"""
122
123
124
125
shell:
    """
    samtools flagstat {input} > {output} 2> {log}
    """
133
134
135
136
shell:
	"""
	ataqv {config[ataqv_g]} {input} --metrics-file {output} 2> {log}
	"""
144
145
146
147
148
shell:
	"""

	mkarv 11ATAC_qc_html {input}
	"""
159
160
161
162
163
164
165
166
167
168
169
shell:
	"""
	# remove duplicates and reads on chrM, coordinate sort the bam
	# samblaster expects name sorted bamq
	samtools view -h {input} | samblaster --removeDups \
	| grep -v -P '\tchrM\t' \
	| samtools view -Sb -F 4 - \
	| samtools sort -m 2G -@ 5 -T {input}.tmp -o {output[0]}

	samtools index {output[0]}
	"""
178
179
180
181
182
shell:
    """
    Rscript  /scratch/genomic_med/apps/phantompeak/phantompeakqualtools/run_spp_nodups.R -c={input} -savp -rf  -p=4 -odir=05phantompeakqual  -out={output} -tmpdir=05phantompeakqual 2> {log}

    """
SnakeMake From line 178 of master/Snakefile
192
193
194
195
shell:
    """
    samtools flagstat {input} > {output} 2> {log}
    """
205
206
207
208
209
210
211
run:
    import re
    import subprocess
    with open (input[2], "r") as f:
        # fifth line contains the number of mapped reads
        line = f.readlines()[5]
        match_number = re.match(r'(\d.+) \+.+', line)
SnakeMake From line 205 of master/Snakefile
233
234
235
236
237
238
239
shell:
    """

	# no window smoothing is done, for paired-end, bamCoverage will extend the length to the fragement length of the paired reads
    bamCoverage -b {input[0]} --ignoreDuplicates --skipNonCoveredRegions --normalizeUsingRPKM -p 5 --extendReads -o {output} 2> {log}

    """
249
250
251
252
253
254
255
256
257
shell:
    """

   ## for macs2, when nomodel is set, --extsize is default to 200bp, this is the same as 2 * shift-size in macs14.
    macs2 callpeak -t {input[0]} \
        --keep-dup all -f BAMPE -g {config[macs2_g]} \
        --outdir 08peak_macs2 -n {params.name} -p {config[macs2_pvalue]} \
        --broad --broad-cutoff {config[macs2_pvalue_broad]} &> {log}
    """
267
268
269
270
271
shell:
    """
    multiqc 02fqc 04aln 00log -o 10multiQC -d -f -v -n multiQC_log 2> {log}

    """
281
282
283
284
shell:
	"""
	cat {input} | bedtools slop -b 200 -g {config[genome_size]} | sort -k1,1 -k2,2n | bedtools merge > {output} 2> {log}
	"""
297
298
299
300
301
302
shell:
	"""

	cd 09nucleoATAC
	nucleoatac run --bed {params.outputdir}/{input[2]} --bam {params.outputdir}/{input[0]} --cores 5 --fasta {config[genome_fasta]} --out {wildcards.sample} 2> {params.outputdir}/{log}
	"""
SnakeMake From line 297 of master/Snakefile
ShowHide 9 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/crazyhottommy/pyflow-ATACseq
Name: pyflow-atacseq
Version: v1.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...