ENCODE pipeline for histone marks developed for the psychENCODE project

public public 10mo ago 0 bookmarks

Summary

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.

The orginal specs for the ENCODE pipeline can be found here : https://goo.gl/KqHjKH

This specific version is using the workflow system snakemake (https://bitbucket.org/snakemake/snakemake/wiki/Home)

Code Snippets

 9
10
shell:
	"bwa mem -M -t {threads} {config[genome]} {input} | gzip -c > {output}"
11
12
shell:
	"bwa mem -M -t {threads} {config[genome]} {input} | gzip -c > {output}"
 7
 8
 9
10
11
12
13
14
15
16
17
	shell:
		"""
		zcat {input} | \
                awk 'BEGIN{{FS="\t"; OFS="\t"}}
                !/^@/ && $6!="*" {{
                cigar=$6; gsub("[0-9]+D","",cigar);
                n=split(cigar, vals, "[A-Z]");
                s=0; for(i=1;i<=n;i++) s=s+vals[i];
		seqlen=length($10);
		if(s!=seqlen) print $1"\t";}}' | sort | uniq > {output}
		"""
27
28
29
30
31
32
33
34
35
run:
	with open("mapped_reads/" + wildcards.sample + ".badcigar") as fp:
		peek = [x for i,x in enumerate(fp) if i<10 and x.rstrip()]
	cigar_len=len(peek)
	shell("""if [ {cigar_len} -gt 0 ]
	then
		zcat {input.sam} | grep -vF -f {input.cigar} | \
		samtools view -@ {threads} -Su - | \
		samtools sort -@ {threads} -T {params.prefix} -o {output} -
53
54
55
56
57
58
59
60
61
62
63
	shell:
		"""
			samtools view -@ {threads} -F1804 -f2 -q 20 -h \
                        -u {input} | \
                        samtools sort -@ {threads} -m {params.mem} -T {params.prefix} -n -o {output.tmp} -

			samtools fixmate -r {output.tmp} {output.clean}

			samtools view -@ {threads} -F1804 -f2 -u {output.clean} |\
                        samtools sort -@ {threads} -m {params.mem} -T {params.prefix}_2 -o {output.flt} -
		"""
74
75
76
77
78
79
80
81
shell:
	"""
	java -Xmx4G -jar {config[path]}/picard-tools-1.141/picard.jar MarkDuplicates\
	INPUT={input} OUTPUT={output.temp} \
	METRICS_FILE={output.qc} ASSUME_SORTED=true \
	VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=false

	"""
 93
 94
 95
 96
 97
 98
 99
100
101
102
		shell:
			"""
			samtools view -@ {threads} -h {input} |\
                        grep -vF -f {config[sponges]} - |\
                        samtools view -@ {threads} -b -o {output.flt} -

			samtools sort -@ {threads} -m {params.mem} -T {params.prefix} -n -o {output.sponge_temp} {output.flt}
			samtools fixmate -r {output.sponge_temp} {output.flt}

			"""
116
117
118
119
120
121
122
123
124
125
		shell:
                        """
			samtools view -@ {threads} -F1804 -f2 -b -u {input} |\
                        samtools sort -@ {threads} -m {params.mem} -T {params.prefix}_2 -o {output.final_bam} -

                        #samtools view -@ {threads} -F1804 -f2 -b -o {output.final_bam} {input}
                        samtools index {output.final_bam} {output.final_bai}

			samtools sort -@ {threads} -m {params.mem} -T {params.prefix} -o {output.tmp_sort} -n {input}
                        """	
136
137
138
139
140
141
142
143
shell:
	"""
	java -Xmx4G -jar {config[path]}/picard-tools-1.141/picard.jar MarkDuplicates\
	INPUT={input} OUTPUT={output.flt} \
	METRICS_FILE={output.qc} ASSUME_SORTED=true \
	VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=false

	"""
155
156
157
158
159
160
		shell:
                        """
                        samtools view -@ {threads} -F1804 -f2 -b -T {params.prefix} -o {output.final_bam} {input}
                        samtools index {output.final_bam} {output.final_bai}

                        """	
173
174
175
176
177
178
	shell:
		"""
		samtools sort -@ {threads} -T {params.prefix} -o {output.tmp_sort} -n {input.final_bam}
		bedtools bamtobed -bedpe -mate1 -i {output.tmp_sort} |\
                gzip -c > {output.bedpe} 
		"""
187
188
189
190
191
192
193
194
	shell:
		"""zcat {input} | \
                awk 'BEGIN{{OFS="\t"; FS="\t"}}
                {{ chrom=$1; beg=$2; end=$6;
                if($2>$5){{beg=$5}} if($3>$6){{end=$3}}
                print chrom,beg,end
                }}' - | {config[sort]} --parallel={threads} -S 2G -k1,1 -k2,2n | \
                gzip -c > {output.bed}"""
 8
 9
10
11
12
	shell:
		"""
            	samtools view -@ {threads} -Su {input.sam} |\
                samtools sort -@ {threads} -T {params.prefix} -o {output} -
		"""
21
22
23
24
25
	shell:
		"""
			samtools view -@ {threads} -F 1804 -q 20 \
                        -u {input} -o {output.flt}
		"""
36
37
38
39
40
41
42
43
shell:
	"""
	java -Xmx4G -jar {config[path]}/picard-tools-1.141/picard.jar MarkDuplicates\
	INPUT={input} OUTPUT={output.temp} \
	METRICS_FILE={output.qc} ASSUME_SORTED=true \
	VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=false

	"""
51
52
53
54
55
56
57
		shell:
			"""
			samtools view -@ {threads} -h {input} |\
                        grep -vF -f {config[sponges]} - |\
                        samtools view -@ {threads} -b -o {output.flt} -

			"""
70
71
72
73
74
75
76
		shell:
                        """
                        samtools view -@ {threads} -F1804 -b -o {output.final_bam} {input}
                        samtools index {output.final_bam} {output.final_bai}

			samtools sort -@ {threads} -n -T {params.prefix} -o {output.tmp_sort} {input}
                        """	
87
88
89
90
91
92
93
94
shell:
	"""
	java -Xmx4G -jar {config[path]}/picard-tools-1.141/picard.jar MarkDuplicates\
	INPUT={input} OUTPUT={output.flt} \
	METRICS_FILE={output.qc} ASSUME_SORTED=true \
	VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=false

	"""
107
108
109
110
111
112
113
		shell:
                        """
                        samtools view -@ {threads} -F1804 -b -o {output.final_bam} {input}
                        samtools index {output.final_bam} {output.final_bai}

			samtools sort -@ {threads} -n -T {params.prefix} -o {output.tmp_sort} {input}
                        """	
122
123
124
125
126
127
	shell:
		"""
		bedtools bamtobed -i {input} | cut -f1-3 | \
		{config[sort]} --parallel={threads} -k1,1 -k2,2n | \
                gzip -c > {output.bed} 
		"""
9
shell:"samtools flagstat {input.final} > {output.final_mapstats}"
18
shell:"samtools flagstat {input.raw} > {output.raw_mapstats}"
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
	shell:
		"""
		samtools sort -n -@ {threads} -T {params.prefix} -o {output.tmp_sort_bam} {input.flt}
		echo 1 | \
                     awk '{{print "#Total\tDistinct\tOne\tTwo\tNRF\tPBC1\tPBC2"}}' \
                     > {output.lib_complexity}

		bedtools bamtobed -bedpe -i {output.tmp_sort_bam} | \
                     awk 'BEGIN{{OFS="\t"}}{{print $1,$2,$4,$6,$9,$10}}' | \
                     grep -v 'chrM' | sort | uniq -c | \
                     awk 'BEGIN{{mt=0;m0=0;m1=0;m2=0; OFS="\t"}}
                               ($1==1){{m1=m1+1}} ($1==2){{m2=m2+1}} {{m0=m0+1}} {{mt=mt+$1}}
                          END{{print mt,m0,m1,m2,m0/mt,m1/m0,m1/m2}}'\
                     >> {output.lib_complexity}
		"""
9
shell: "samtools flagstat {input.final} > {output.final_mapstats}"
18
shell: "samtools flagstat {input.raw} > {output.raw_mapstats}"
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
	shell:
		"""
		samtools sort -n -@ {threads} -T {params.prefix} -o {output.tmp_sort_bam} {input.flt}
		echo 1 | \
                     awk '{{print "#Total\tDistinct\tOne\tTwo\tNRF\tPBC1\tPBC2"}}' \
                     > {output.lib_complexity}

		bedtools bamtobed -i {output.tmp_sort_bam} | \
                     awk 'BEGIN{{OFS="\t"}}{{print $1,$2,$3,$6}}' | \
                     grep -v 'chrM' | sort | uniq -c | \
                     awk 'BEGIN{{mt=0;m0=0;m1=0;m2=0; OFS="\t"}}
                               ($1==1){{m1=m1+1}} ($1==2){{m2=m2+1}} {{m0=m0+1}} {{mt=mt+$1}}
                          END{{if(m2>0) {{print mt,m0,m1,m2,m0/mt,m1/m0,m1/m2}}else{{print mt,m0,m1,m2,m0/mt,m1/m0,"-INF"}}}}'\
                     >> {output.lib_complexity}
		"""
11
12
13
14
15
16
shell:
	"""
	bamToBed -i {input} | awk 'BEGIN{{OFS="\\t"}}{{$4="N";$5="1000";print $0}}' |\
	tee {output.temp_tagAlign} |\
	gzip -c > {output.tagAlign}
	"""
27
28
29
30
31
32
33
34
35
	shell:
		"""
		grep -v 'chrM' {input.temp_tagAlign} |\
		shuf -n 15000000  |awk 'BEGIN{{OFS="\\t"}}{{$4="N";$5="1000";print $0}}'  > {output.subsample} 

		Rscript {config[path]}/run_spp_nodups.R -c={output.subsample} -p={threads} -filtchr=chrM -savp={output.cc_plot} -out={output.cc_score} > /dev/null 2>&1
        	sed -r  's/,[^\\t]+//g' {output.cc_score} > {output.temp_cc_score} 
		cp {output.temp_cc_score} {output.cc_score}		
		"""
11
12
13
14
15
16
shell:
	"""
	bamToBed -i {input} | awk 'BEGIN{{OFS="\\t"}}{{$4="N";$5="1000";print $0}}' |\
	tee {output.temp_tagAlign} |\
	gzip -c > {output.tagAlign}
	"""
27
28
29
30
31
32
33
34
35
	shell:
		"""
		grep -v 'chrM' {input.temp_tagAlign} |\
		shuf -n 15000000  > {output.subsample} 

		Rscript {config[path]}/run_spp_nodups.R -c={output.subsample} -p={threads} -filtchr=chrM -savp={output.cc_plot} -out={output.cc_score} > /dev/null 2>&1
        	sed -r  's/,[^\\t]+//g' {output.cc_score} > {output.temp_cc_score} 
		cp {output.temp_cc_score} {output.cc_score}		
		"""
13
14
15
16
17
18
19
20
21
22
23
	shell:
		"""
		#If the file is too big '--random-source' option will fail. That's why the '40000000'
		zcat {input} | shuf --random-source=<(zcat {input} | head -40000000) > {output.shuf}
		{config[split]} -d -nl/2 --additional-suffix=\".bed\" {output.shuf} bed_files/psr_{wildcards.sample}.

		{config[sort]} --parallel={threads} -S 2G \
                     -k1,1 -k2,2n {output.temp_psr1} | gzip -c > {output.psr1} 
		{config[sort]} --parallel={threads} -S 2G \
                     -k1,1 -k2,2n {output.temp_psr2} | gzip -c > {output.psr2} 
		"""
32
33
34
35
shell:
	"""
	zcat {input} | {config[sort]} --parallel={threads} -k1,1 -k2,2n | gzip -c > {output}
	"""
47
48
49
50
51
shell:
	"""
	zcat {input.input1} | {config[sort]} --parallel={threads} -k1,1 -k2,2n | gzip -c > {output.rep1}
	zcat {input.input2} | {config[sort]} --parallel={threads} -k1,1 -k2,2n | gzip -c > {output.rep2}
	"""
59
60
61
62
shell:
	"""
	zcat {input} | {config[sort]} --parallel={threads} -k1,1 -k2,2n | gzip -c > {output}
	"""
32
33
34
35
36
37
38
run:
	import os, sys, time,subprocess
	#Calling narrowPeaks	
	shell("""
	fraglen=`cat {input.cc_scores}| cut -f3`

	{config[macs2]} callpeak \
409
410
411
412
413
414
415
416
417
418
419
420
421
422
run:
	import os, sys, time,subprocess
	sum_len = 0
	n_files = 0
	for files in {input.cc_scores}:
		for file in files:
			n_files+=1
			with open(file,'r') as f:
				sum_len+=int(f.readlines()[0].split("\t")[2])
	fraglen=int(sum_len/n_files)

	#Calling narrowPeaks	
	shell("""
	{config[macs2]} callpeak \
589
590
591
592
593
594
595
596
597
598
599
600
601
602
run:
	import os, sys, time,subprocess
	sum_len = 0
	n_files = 0
	for files in {input.cc_scores}:
		for file in files:
			n_files+=1
			with open(file,'r') as f:
				sum_len+=int(f.readlines()[0].split("\t")[2])
	fraglen=int(sum_len/n_files)

	#Calling narrowPeaks	
	shell("""
	{config[macs2]} callpeak \
769
770
771
772
773
774
775
776
777
778
779
780
781
782
run:
	import os, sys, time,subprocess
	sum_len = 0
	n_files = 0
	for files in {input.cc_scores}:
		for file in files:
			n_files+=1
			with open(file,'r') as f:
				sum_len+=int(f.readlines()[0].split("\t")[2])
	fraglen=int(sum_len/n_files)

	#Calling narrowPeaks	
	shell("""
	{config[macs2]} callpeak \
33
34
35
36
37
38
39
run:
	import os, sys, time,subprocess
	#Calling narrowPeaks	
	shell("""
	fraglen=`cat {input.cc_scores}| cut -f3`

	{config[macs2]} callpeak \
412
413
414
415
416
417
418
419
420
421
422
423
424
425
run:
	import os, sys, time,subprocess
	sum_len = 0
	n_files = 0
	for files in {input.cc_scores}:
		for file in files:
			n_files+=1
			with open(file,'r') as f:
				sum_len+=int(f.readlines()[0].split("\t")[2])
	fraglen=int(sum_len/n_files)

	#Calling narrowPeaks	
	shell("""
	{config[macs2]} callpeak \
32
33
34
35
36
37
38
run:
	import os, sys, time,subprocess
	#Calling narrowPeaks	
	shell("""
	fraglen=`cat {input.cc_scores} | cut -f3`

	{config[macs2]} callpeak \
435
436
437
438
439
440
441
442
443
444
445
446
447
448
run:
	import os, sys, time,subprocess
	sum_len = 0
	n_files = 0
	for files in {input.cc_scores}:
		for file in files:
			n_files+=1
			with open(file,'r') as f:
				sum_len+=int(f.readlines()[0].split("\t")[2])
	fraglen=int(sum_len/n_files)

	#Calling narrowPeaks	
	shell("""
	{config[macs2]} callpeak \
34
35
36
37
38
39
40
run:
	import os, sys, time,subprocess
	#Calling narrowPeaks	
	shell("""
	fraglen=`cat {input.cc_scores} | cut -f3`

	{config[macs2]} callpeak \
413
414
415
416
417
418
419
420
421
422
423
424
425
426
run:
	import os, sys, time,subprocess
	sum_len = 0
	n_files = 0
	for files in {input.cc_scores}:
		for file in files:
			n_files+=1
			with open(file,'r') as f:
				sum_len+=int(f.readlines()[0].split("\t")[2])
	fraglen=int(sum_len/n_files)

	#Calling narrowPeaks	
	shell("""
	{config[macs2]} callpeak \
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
	shell:
		"""
		bedtools intersect \
                     -a {input.narrowPeak} -b {input.narrowPeak_psr1} -f 0.50 -F 0.50 -e -u |\
                     bedtools intersect \
                     -a stdin -b {input.narrowPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.narrow}

		sort -k1,1 -k2,2n -o finalPeaks/tmp_{params.prefix}_narrow_bb {output.narrow}
		{config[path]}/bedToBigBed -type='bed6+4' -as={config[path]}/narrowPeak.as finalPeaks/tmp_{params.prefix}_narrow_bb {config[chrom]} {output.narrowPeak_bb}
		rm -f finalPeaks/tmp_{params.prefix}_narrow_bb

		bedtools intersect \
                     -a {input.broadPeak} -b {input.broadPeak_psr1} -f 0.50 -F 0.50 -e -u |\
                     bedtools intersect \
                     -a stdin -b {input.broadPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.broad}

		sort -k1,1 -k2,2n -o finalPeaks/tmp_{params.prefix}_broad_bb {output.broad}
		{config[path]}/bedToBigBed -type='bed6+4' -as={config[path]}/broadPeak.as finalPeaks/tmp_{params.prefix}_broad_bb {config[chrom]} {output.broadPeak_bb}
		rm -f finalPeaks/tmp_{params.prefix}_broad_bb

		bedtools intersect \
                     -a {input.gappedPeak} -b {input.gappedPeak_psr1} -f 0.50 -F 0.50 -e -u |\
                     bedtools intersect \
                     -a stdin -b {input.gappedPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.gapped}

		sort -k1,1 -k2,2n -o finalPeaks/tmp_{params.prefix}_gapped_bb {output.gapped}
		{config[path]}/bedToBigBed -type='bed6+4' -as={config[path]}/gappedPeak.as finalPeaks/tmp_{params.prefix}_gapped_bb {config[chrom]} {output.gappedPeak_bb}
		rm -f finalPeaks/tmp_{params.prefix}_gapped_bb

		sleep 120
		"""
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
	shell:
		"""
		bedtools intersect \
                     -a {input.narrowPeak} -b {input.narrowPeak_psr1} -f 0.50 -F 0.50 -e -u |\
                     bedtools intersect \
                     -a stdin -b {input.narrowPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.narrow}

		sort -k1,1 -k2,2n -o finalPeaks/tmp_{params.prefix}_narrow_bb {output.narrow}
		{config[path]}/bedToBigBed -type='bed6+4' -as={config[path]}/narrowPeak.as finalPeaks/tmp_{params.prefix}_narrow_bb {config[chrom]} {output.narrowPeak_bb}
		rm -f finalPeaks/tmp_{params.prefix}_narrow_bb

		bedtools intersect \
                     -a {input.broadPeak} -b {input.broadPeak_psr1} -f 0.50 -F 0.50 -e -u |\
                     bedtools intersect \
                     -a stdin -b {input.broadPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.broad}

		sort -k1,1 -k2,2n -o finalPeaks/tmp_{params.prefix}_broad_bb {output.broad}
		{config[path]}/bedToBigBed -type='bed6+4' -as={config[path]}/broadPeak.as finalPeaks/tmp_{params.prefix}_broad_bb {config[chrom]} {output.broadPeak_bb}
		rm -f finalPeaks/tmp_{params.prefix}_broad_bb

		bedtools intersect \
                     -a {input.gappedPeak} -b {input.gappedPeak_psr1} -f 0.50 -F 0.50 -e -u |\
                     bedtools intersect \
                     -a stdin -b {input.gappedPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.gapped}

		sort -k1,1 -k2,2n -o finalPeaks/tmp_{params.prefix}_gapped_bb {output.gapped}
		{config[path]}/bedToBigBed -type='bed6+4' -as={config[path]}/gappedPeak.as finalPeaks/tmp_{params.prefix}_gapped_bb {config[chrom]} {output.gappedPeak_bb}
		rm -f finalPeaks/tmp_{params.prefix}_gapped_bb

		sleep 120
		"""
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
	shell:
		"""
		bedtools intersect \
                     -a {input.narrowPeak} -b {input.narrowPeak_psr1} -f 0.50 -F 0.50 -e -u |\
                     bedtools intersect \
                     -a stdin -b {input.narrowPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.narrow}

		sort -k1,1 -k2,2n -o finalPeaks/tmp_{params.prefix}_narrow_bb {output.narrow}
		{config[path]}/bedToBigBed -type='bed6+4' -as={config[path]}/narrowPeak.as finalPeaks/tmp_{params.prefix}_narrow_bb {config[chrom]} {output.narrowPeak_bb}
		rm -f finalPeaks/tmp_{params.prefix}_narrow_bb

		bedtools intersect \
                     -a {input.broadPeak} -b {input.broadPeak_psr1} -f 0.50 -F 0.50 -e -u |\
                     bedtools intersect \
                     -a stdin -b {input.broadPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.broad}

		sort -k1,1 -k2,2n -o finalPeaks/tmp_{params.prefix}_broad_bb {output.broad}
		{config[path]}/bedToBigBed -type='bed6+4' -as={config[path]}/broadPeak.as finalPeaks/tmp_{params.prefix}_broad_bb {config[chrom]} {output.broadPeak_bb}
		rm -f finalPeaks/tmp_{params.prefix}_broad_bb

		bedtools intersect \
                     -a {input.gappedPeak} -b {input.gappedPeak_psr1} -f 0.50 -F 0.50 -e -u |\
                     bedtools intersect \
                     -a stdin -b {input.gappedPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.gapped}

		sort -k1,1 -k2,2n -o finalPeaks/tmp_{params.prefix}_gapped_bb {output.gapped}
		{config[path]}/bedToBigBed -type='bed6+4' -as={config[path]}/gappedPeak.as finalPeaks/tmp_{params.prefix}_gapped_bb {config[chrom]} {output.gappedPeak_bb}
		rm -f finalPeaks/tmp_{params.prefix}_gapped_bb

		sleep 120
		"""
11
12
13
14
15
16
shell:
	"""
	bwa mem -M -t {threads} {config[genome]} {input} | gzip -c > {output.sam}

	samtools view -@ {threads} -Su {output.sam} | samtools sort -@ {threads} -T {params.prefix} -o  {output.bam}
	"""
 9
10
11
12
13
	shell:
		"""
			samtools view -@ {threads} -F 1804 -q 20 \
                        -b {input} -o {output.flt}
		"""
24
25
26
27
28
29
30
31
shell:
	"""
	java -Xmx4G -jar {config[path]}/picard-tools-1.141/picard.jar MarkDuplicates\
	INPUT={input} OUTPUT={output.temp} \
	METRICS_FILE={output.qc} ASSUME_SORTED=true \
	VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=false

	"""
39
40
41
42
43
44
45
		shell:
			"""
			samtools view -@ {threads} -h {input} |\
                        grep -vF -f {config[sponges]} - |\
                        samtools view -@ {threads} -b -o {output.flt} -

			"""
58
59
60
61
62
63
64
		shell:
                        """
                        samtools view -@ {threads} -F1804 -b -o {output.final_bam} {input}
                        samtools index {output.final_bam} {output.final_bai}

			samtools sort -@ {threads} -n -T {params.prefix} -o {output.tmp_sort} {input}
                        """	
75
76
77
78
79
80
81
82
shell:
	"""
	java -Xmx4G -jar {config[path]}/picard-tools-1.141/picard.jar MarkDuplicates\
	INPUT={input} OUTPUT={output.flt} \
	METRICS_FILE={output.qc} ASSUME_SORTED=true \
	VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=false

	"""
 95
 96
 97
 98
 99
100
101
		shell:
                        """
                        samtools view -@ {threads} -F1804 -b -o {output.final_bam} {input}
                        samtools index {output.final_bam} {output.final_bai}

			samtools sort -@ {threads} -n -T {params.prefix} -o {output.tmp_sort} {input}
                        """	
110
111
112
113
114
115
116
shell:
	"""
	bedtools bamtobed -i {input} | \
	awk 'BEGIN{{OFS="\t"}}{{$4="N";$5="1000";print $0}}' | \
	{config[sort]} --parallel={threads} -k1,1 -k2,2n | \
	gzip -nc > {output.tagAlign} 
	"""
9
shell: "samtools flagstat {input.final} > {output.final_mapstats}"
18
shell: "samtools flagstat {input.raw} > {output.raw_mapstats}"
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
	shell:
		"""
		samtools sort -n -@ {threads} -T {params.prefix} -o {output.tmp_sort_bam} {input.flt}
		echo 1 | \
                     awk '{{print "#TotalReadPairs\tDistinctReadPairs\tOneReadPair\tTwoReadPairs\tNRF=Distinct/Total\tPBC1=OnePair/Distinct\tPBC2=OnePair/TwoPair"}}' \
                     > {output.lib_complexity}

		bedtools bamtobed -i {output.tmp_sort_bam} | \
                     awk 'BEGIN{{OFS="\t"}}{{print $1,$2,$3,$6}}' | \
                     grep -v 'chrM' | sort | uniq -c | \
                     awk 'BEGIN{{mt=0;m0=0;m1=0;m2=0; OFS="\t"}}
                               ($1==1){{m1=m1+1}} ($1==2){{m2=m2+1}} {{m0=m0+1}} {{mt=mt+$1}}
                          END{{if(m2>0) {{print mt,m0,m1,m2,m0/mt,m1/m0,m1/m2}}else{{print mt,m0,m1,m2,m0/mt,m1/m0,"-INF"}}}}'\
                     >> {output.lib_complexity}
		"""
10
11
12
13
14
15
16
17
18
19
20
21
	shell:
		"""
		NREADS=15000000

		zcat {input.tagAlign} | grep -v 'chrM' |\
		shuf -n $NREADS | gzip -nc > {output.subsample} 

		Rscript {config[path]}/run_spp_nodups.R -c={output.subsample} -p={threads} -filtchr=chrM -savp={output.cc_plot} -out={output.cc_score} > /dev/null 2>&1
        	sed -r  's/,[^\\t]+//g' {output.cc_score} > {output.cc_score}.tmp 
		mv {output.cc_score}.tmp {output.cc_score}		
		sleep 120
		"""
13
14
15
16
17
18
19
20
21
22
23
	shell:
		"""
		#If the file is too big '--random-source' option will fail. That's why the '40000000'
		zcat {input} | shuf --random-source=<(zcat {input} | head -40000000) > {output.shuf}
		{config[split]} -d -nl/2 --additional-suffix=\".tagAlign\" {output.shuf} bed_files/psr_{wildcards.sample}.

		{config[sort]} --parallel={threads} -S 2G \
                     -k1,1 -k2,2n {output.temp_psr1} | gzip -c > {output.psr1}
		{config[sort]} --parallel={threads} -S 2G \
                     -k1,1 -k2,2n {output.temp_psr2} | gzip -c > {output.psr2}
		"""
32
33
34
35
shell:
	"""
	zcat {input} | {config[sort]} --parallel={threads} -k1,1 -k2,2n | gzip -c > {output}
	"""
43
44
45
46
shell:
	"""
	zcat {input} | {config[sort]} --parallel={threads} -k1,1 -k2,2n | gzip -c > {output}
	"""
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
shell:
	"""
	fraglen=`cat {input.cc_scores} | cut -f3`

	{config[macs2]} callpeak \
	-t {input.ip} -c {input.control}\
	-f BED -n peaks/{params.prefix}\
	-g {config[macs_g]} -p 1e-2 --nomodel --shift 0 --extsize $fraglen \
	--keep-dup all -B --SPMR

	sort -k 8gr,8gr {output.narrowPeak} |\
	awk 'BEGIN{{OFS="\t"}}{{$4="Peak_"NR ; print $0}}'| \
	gzip -nc > {output.narrowPeak_compressed}

	rm -f peaks/{params.prefix}_peaks.xls peaks/{params.prefix}_summits.bed
	touch {output.checkpoint}
	"""
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
shell:
	"""
	fraglen=`cat {input.cc_scores} | cut -f3`

	{config[macs2]} callpeak \
	-t {input.ip_psr1} -c {input.control}\
	-f BED -n peaks/{params.prefix}\
	-g {config[macs_g]} -p 1e-2 --nomodel --shift 0 --extsize $fraglen \
	--keep-dup all -B --SPMR

	sort -k 8gr,8gr {output.narrowPeak_psr1} |\
	awk 'BEGIN{{OFS="\t"}}{{$4="Peak_"NR ; print $0}}'| \
	gzip -nc > {output.narrowPeak_psr1_compressed}

	rm -f peaks/{params.prefix}_peaks.xls peaks/{params.prefix}_summits.bed

	{config[macs2]} callpeak \
	-t {input.ip_psr2} -c {input.control}\
	-f BED -n peaks/{params.prefix_2}\
	-g {config[macs_g]} -p 1e-2 --nomodel --shift 0 --extsize $fraglen \
	 --keep-dup all -B --SPMR

	sort -k 8gr,8gr {output.narrowPeak_psr2} |\
	awk 'BEGIN{{OFS="\t"}}{{$4="Peak_"NR ; print $0}}'| \
	gzip -nc > {output.narrowPeak_psr2_compressed}

	rm -f peaks/{params.prefix_2}_peaks.xls peaks/{params.prefix_2}_summits.bed
	touch {output.checkpoint}
	"""
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
shell:
	"""
	fraglen=`cat {input.cc_scores} | cut -f3`

	{config[macs2]} callpeak \
	-t {input.ip} -c {input.control}\
	-f BED -n peaks/{params.prefix}\
	-g {config[macs_g]} -p 1e-2 --broad --nomodel --shift 0 --extsize $fraglen \
	--keep-dup all

	sort -k 8gr,8gr {output.broadPeak} | \
	awk 'BEGIN{{OFS="\t"}}{{$4="Peak_"NR ; print $0}}'| \
	gzip -nc > {output.broadPeak_compressed}

	sort -k 14gr,14gr {output.broadPeak} | \
	awk 'BEGIN{{OFS="\t"}}{{$4="Peak_"NR ; print $0}}'| \
	gzip -nc > {output.gappedPeak_compressed}


	rm -f peaks/{params.prefix}_peaks.xls peaks/{params.prefix}_summits.bed

	"""
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
shell:
	"""
	{config[macs2]} bdgcmp  \
	-t {input.treat} -c {input.control_lambda} \
	--outdir peaks/signals -o {params.prefix}_FE.bdg -m FE

	bedtools slop -i {output.fe_bdg} -g {config[chrom]} -b 0 |\
	awk '{{ if($3!=-1) print $0}}' | \
	{config[path]}/bedClip stdin {config[chrom]} {output.fc_signal_bdg}


	{config[path]}/bedGraphToBigWig {output.fc_signal_bdg} {config[chrom]} {output.fc_signal}


	chipReads=$(zcat {input.ip} | wc -l | awk '{{printf "%f", $1/1000000}}');
	controlReads=$(zcat {input.control} | wc -l | awk '{{printf "%f", $1/1000000}}');

	sval=$(echo "${{chipReads}} ${{controlReads}}" | awk '$1>$2{{printf "%f",$2}} $1<=$2{{printf "%f",$1}}');

	macs2 bdgcmp -t {input.treat} -c {input.control_lambda} --outdir peaks/signals -o {params.prefix}_ppois.bdg -m ppois -S ${{sval}}

	slopBed -i {output.ppois} -g {config[chrom]} -b 0 | \
	awk '{{if ($3 != -1) print $0}}' |  \
	bedClip stdin {config[chrom]} {output.pval_bdg}

	bedGraphToBigWig {output.pval_bdg} {config[chrom]} {output.pval}

	"""
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
shell:
	"""
	fraglen=`cat {input.cc_scores} | cut -f3`

	{config[macs2]} callpeak \
	  -t {input.ip_psr1} -c {input.control}\
	  -f BED -n peaks/{params.prefix}\
	  -g {config[macs_g]} -p 1e-2 --broad --nomodel --shift 0 --extsize $fraglen \
	  --keep-dup all

	sort -k 8gr,8gr {output.broadPeak_psr1} | \
	awk 'BEGIN{{OFS="\t"}}{{$4="Peak_"NR ; print $0}}'| \
	gzip -nc > {output.broadPeak_psr1_compressed}

	sort -k 14gr,14gr {output.broadPeak_psr1} | \
	awk 'BEGIN{{OFS="\t"}}{{$4="Peak_"NR ; print $0}}'| \
	gzip -nc > {output.gappedPeak_psr1_compressed}


	rm -f peaks/{params.prefix}_peaks.xls peaks/{params.prefix}_summits.bed


	{config[macs2]} callpeak \
	-t {input.ip_psr2} -c {input.control}\
	-f BED -n peaks/{params.prefix_2}\
	-g {config[macs_g]} -p 1e-2 --broad --nomodel --shift 0 --extsize $fraglen \
	--keep-dup all

	sort -k 8gr,8gr {output.broadPeak_psr2} | \
	awk 'BEGIN{{OFS="\t"}}{{$4="Peak_"NR ; print $0}}'| \
	gzip -nc > {output.broadPeak_psr2_compressed}

	sort -k 14gr,14gr {output.broadPeak_psr2} | \
	awk 'BEGIN{{OFS="\t"}}{{$4="Peak_"NR ; print $0}}'| \
	gzip -nc > {output.gappedPeak_psr2_compressed}


	rm -f peaks/{params.prefix_2}_peaks.xls peaks/{params.prefix_2}_summits.bed
	"""
253
254
255
256
257
258
259
260
261
262
263
264
265
run:
	import os, sys, time,subprocess
	sum_len = 0
	n_files = 0
	for files in {input.cc_scores}:
		for file in files:
			n_files+=1
			with open(file,'r') as f:
				sum_len+=int(f.readlines()[0].split("\t")[2])
	fraglen=int(sum_len/n_files)

	shell("""
	{config[macs2]} callpeak \
295
296
297
298
299
300
301
302
303
304
305
306
307
308
run:
	import os, sys, time,subprocess
	sum_len = 0
	n_files = 0
	for files in {input.cc_scores}:
		for file in files:
			n_files+=1
			with open(file,'r') as f:
				sum_len+=int(f.readlines()[0].split("\t")[2])
	fraglen=int(sum_len/n_files)

	shell("""

	{config[macs2]} callpeak \
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
	shell:
		"""
		{config[macs2]} bdgcmp  \
                 -t {input.treat} -c {input.control_lambda} \
                 --outdir peaks/signals -o {params.prefix}_FE.bdg -m FE

		bedtools slop -i {output.fe_bdg} -g {config[chrom]} -b 0 |\
		awk '{{ if($3!=-1) print $0}}' | \
                {config[path]}/bedClip stdin {config[chrom]} {output.fc_signal_bdg}


		{config[path]}/bedGraphToBigWig {output.fc_signal_bdg} {config[chrom]} {output.fc_signal}


		chipReads=$(zcat {input.ip} | wc -l | awk '{{printf "%f", $1/1000000}}');
		controlReads=$(zcat {input.control} | wc -l | awk '{{printf "%f", $1/1000000}}');

		sval=$(echo "${{chipReads}} ${{controlReads}}" | awk '$1>$2{{printf "%f",$2}} $1<=$2{{printf "%f",$1}}');

		macs2 bdgcmp -t {input.treat} -c {input.control_lambda} --outdir peaks/signals -o {params.prefix}_ppois.bdg -m ppois -S ${{sval}}

		slopBed -i {output.ppois} -g {config[chrom]} -b 0 | \
		awk '{{if ($3 != -1) print $0}}' |  \
		bedClip stdin {config[chrom]} {output.pval_bdg}

		bedGraphToBigWig {output.pval_bdg} {config[chrom]} {output.pval}

		"""
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
shell:
	"""
	fraglen=`cat {input.cc_scores} | cut -f3`

	#REP
	Rscript {config[path]}/run_spp.R -c={input.ip} -i={input.control} -p={threads} -npeak=300000 -odir=peaks_spp/{params.prefix} -speak=$fraglen -savr -savp -rf -out=peaks_spp/{params.prefix}/{params.prefix}.tagAlign.ccscores 

	zcat {output.narrowPeak_raw} | \
	awk 'BEGIN{{OFS="\t"}}{{ if ($2<0) $2=0; print $1,int($2),int($3),$4,$5,$6,$7,$8,$9,$10;}}' | \
	gzip -f -nc > {output.narrowPeak_unfiltered}

	bedtools intersect -v -a <(zcat -f {output.narrowPeak_unfiltered}) -b <(zcat -f {config[path]}/blacklist_hg19.bed.gz) | \
	awk 'BEGIN{{OFS="\t"}} {{if ($5>1000) $5=1000; print $0}}' | \
	grep -P 'chr[\dXY]+[ \t]' | \
	gzip -nc > {output.narrowPeak_final}


	#SELF-PSEUDOREP1
	Rscript {config[path]}/run_spp.R -c={input.ip_psr1} -i={input.control} -p={threads} -npeak=300000 -odir=peaks_spp/psr00_{params.prefix} -speak=$fraglen -savr -savp -rf -out=peaks_spp/psr00_{params.prefix}/psr_{params.prefix}.00.tagAlign.ccscores 

	zcat {output.narrowPeak_raw_0} | \
	awk 'BEGIN{{OFS="\t"}}{{ if ($2<0) $2=0; print $1,int($2),int($3),$4,$5,$6,$7,$8,$9,$10;}}' | \
	gzip -f -nc > {output.narrowPeak_unfiltered_0}

	bedtools intersect -v -a <(zcat -f {output.narrowPeak_unfiltered_0}) -b <(zcat -f {config[path]}/blacklist_hg19.bed.gz) | \
	awk 'BEGIN{{OFS="\t"}} {{if ($5>1000) $5=1000; print $0}}' | \
	grep -P 'chr[\dXY]+[ \t]' | \
	gzip -nc > {output.narrowPeak_final_0}

	#SELF-PSEUDOREP2
	Rscript {config[path]}/run_spp.R -c={input.ip_psr2} -i={input.control} -p={threads} -npeak=300000 -odir=peaks_spp/psr01_{params.prefix} -speak=$fraglen -savr -savp -rf -out=peaks_spp/psr01_{params.prefix}/psr_{params.prefix}.01.tagAlign.ccscores 

	zcat {output.narrowPeak_raw_1} | \
	awk 'BEGIN{{OFS="\t"}}{{ if ($2<0) $2=0; print $1,int($2),int($3),$4,$5,$6,$7,$8,$9,$10;}}' | \
	gzip -f -nc > {output.narrowPeak_unfiltered_1}

	bedtools intersect -v -a <(zcat -f {output.narrowPeak_unfiltered_1}) -b <(zcat -f {config[path]}/blacklist_hg19.bed.gz) | \
	awk 'BEGIN{{OFS="\t"}} {{if ($5>1000) $5=1000; print $0}}' | \
	grep -P 'chr[\dXY]+[ \t]' | \
	gzip -nc > {output.narrowPeak_final_1}
	"""
17
18
19
20
21
22
23
24
25
26
27
28
29
30
shell:
	"""
	idr --samples {input.narrowPeak_psr1} {input.narrowPeak_psr2} --peak-list {input.narrowPeak} --input-file-type narrowPeak --output-file idr/{params.prefix} --rank signal.value --soft-idr-threshold {params.threshold} --plot --use-best-multisummit-IDR

	IDR_THRESH_TRANSFORMED=`awk -v p={params.threshold} 'BEGIN{{print -log(p)/log(10)}}'`

	awk 'BEGIN{{OFS="\t"}} $12>='"$IDR_THRESH_TRANSFORMED"' {{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}}' idr/{params.prefix}.overlapped-peaks.txt | sort | uniq | sort -k7n,7n | gzip -nc > {output.idr_raw}

	NPEAKS_IDR=`zcat {output.idr_raw} | wc -l`

	bedtools intersect -v -a {output.idr_raw} -b {config[path]}/blacklist_hg19.bed.gz | grep -P 'chr[\dXY]+[ \t]' | awk 'BEGIN{{OFS="\t"}} {{if ($5>1000) $5=1000; print $0}}| gzip -nc > {output.idr_final}

	sleep 120
	"""
14
15
16
17
18
19
20
21
22
	shell:
		"""
		bedtools intersect \
                     -a {input.narrowPeak} -b {input.narrowPeak_psr1} -f 0.50 -F 0.50 -e -u|\
                     bedtools intersect \
                     -a stdin -b {input.narrowPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.narrow}

		sleep 120
		"""
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
	shell:
		"""
		bedtools intersect \
                     -a {input.broadPeak} -b {input.broadPeak_psr1} -f 0.50 -F 0.50 -e -u|\
                     bedtools intersect \
                     -a stdin -b {input.broadPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.broad}


		bedtools intersect \
                     -a {input.gappedPeak} -b {input.gappedPeak_psr1} -f 0.50 -F 0.50 -e -u|\
                     bedtools intersect \
                     -a stdin -b {input.gappedPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.gapped}

		sleep 120
		"""
ShowHide 68 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 10mo ago
Updated: 10mo ago
Maitainers: public
URL: https://github.com/weng-lab/psychip_snakemake
Name: psychip_snakemake
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...