snakemake workflow to run cellranger on a given bucket using gke.

public public 8mo ago 0 bookmarks

Snakemake workflow: Cellranger Snakemake GKE

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine.

Usage

The usage of this workflow is described in the

Code Snippets

 9
10
shell:
	"bwa mem -M -t {threads} {config[genome]} {input} | gzip -c > {output}"
11
12
shell:
	"bwa mem -M -t {threads} {config[genome]} {input} | gzip -c > {output}"
 7
 8
 9
10
11
12
13
14
15
16
17
	shell:
		"""
		zcat {input} | \
                awk 'BEGIN{{FS="\t"; OFS="\t"}}
                !/^@/ && $6!="*" {{
                cigar=$6; gsub("[0-9]+D","",cigar);
                n=split(cigar, vals, "[A-Z]");
                s=0; for(i=1;i<=n;i++) s=s+vals[i];
		seqlen=length($10);
		if(s!=seqlen) print $1"\t";}}' | sort | uniq > {output}
		"""
From line 7 of modules/1_filter_paired
27
28
29
30
31
32
33
34
35
run:
	with open("mapped_reads/" + wildcards.sample + ".badcigar") as fp:
		peek = [x for i,x in enumerate(fp) if i<10 and x.rstrip()]
	cigar_len=len(peek)
	shell("""if [ {cigar_len} -gt 0 ]
	then
		zcat {input.sam} | grep -vF -f {input.cigar} | \
		samtools view -@ {threads} -Su - | \
		samtools sort -@ {threads} -T {params.prefix} -o {output} -
53
54
55
56
57
58
59
60
61
62
63
	shell:
		"""
			samtools view -@ {threads} -F1804 -f2 -q 20 -h \
                        -u {input} | \
                        samtools sort -@ {threads} -m {params.mem} -T {params.prefix} -n -o {output.tmp} -

			samtools fixmate -r {output.tmp} {output.clean}

			samtools view -@ {threads} -F1804 -f2 -u {output.clean} |\
                        samtools sort -@ {threads} -m {params.mem} -T {params.prefix}_2 -o {output.flt} -
		"""
74
75
76
77
78
79
80
81
shell:
	"""
	java -Xmx4G -jar {config[path]}/picard-tools-1.141/picard.jar MarkDuplicates\
	INPUT={input} OUTPUT={output.temp} \
	METRICS_FILE={output.qc} ASSUME_SORTED=true \
	VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=false

	"""
 93
 94
 95
 96
 97
 98
 99
100
101
102
		shell:
			"""
			samtools view -@ {threads} -h {input} |\
                        grep -vF -f {config[sponges]} - |\
                        samtools view -@ {threads} -b -o {output.flt} -

			samtools sort -@ {threads} -m {params.mem} -T {params.prefix} -n -o {output.sponge_temp} {output.flt}
			samtools fixmate -r {output.sponge_temp} {output.flt}

			"""
116
117
118
119
120
121
122
123
124
125
		shell:
                        """
			samtools view -@ {threads} -F1804 -f2 -b -u {input} |\
                        samtools sort -@ {threads} -m {params.mem} -T {params.prefix}_2 -o {output.final_bam} -

                        #samtools view -@ {threads} -F1804 -f2 -b -o {output.final_bam} {input}
                        samtools index {output.final_bam} {output.final_bai}

			samtools sort -@ {threads} -m {params.mem} -T {params.prefix} -o {output.tmp_sort} -n {input}
                        """	
136
137
138
139
140
141
142
143
shell:
	"""
	java -Xmx4G -jar {config[path]}/picard-tools-1.141/picard.jar MarkDuplicates\
	INPUT={input} OUTPUT={output.flt} \
	METRICS_FILE={output.qc} ASSUME_SORTED=true \
	VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=false

	"""
155
156
157
158
159
160
		shell:
                        """
                        samtools view -@ {threads} -F1804 -f2 -b -T {params.prefix} -o {output.final_bam} {input}
                        samtools index {output.final_bam} {output.final_bai}

                        """	
173
174
175
176
177
178
	shell:
		"""
		samtools sort -@ {threads} -T {params.prefix} -o {output.tmp_sort} -n {input.final_bam}
		bedtools bamtobed -bedpe -mate1 -i {output.tmp_sort} |\
                gzip -c > {output.bedpe} 
		"""
187
188
189
190
191
192
193
194
	shell:
		"""zcat {input} | \
                awk 'BEGIN{{OFS="\t"; FS="\t"}}
                {{ chrom=$1; beg=$2; end=$6;
                if($2>$5){{beg=$5}} if($3>$6){{end=$3}}
                print chrom,beg,end
                }}' - | {config[sort]} --parallel={threads} -S 2G -k1,1 -k2,2n | \
                gzip -c > {output.bed}"""
From line 187 of modules/1_filter_paired
 8
 9
10
11
12
	shell:
		"""
            	samtools view -@ {threads} -Su {input.sam} |\
                samtools sort -@ {threads} -T {params.prefix} -o {output} -
		"""
21
22
23
24
25
	shell:
		"""
			samtools view -@ {threads} -F 1804 -q 20 \
                        -u {input} -o {output.flt}
		"""
36
37
38
39
40
41
42
43
shell:
	"""
	java -Xmx4G -jar {config[path]}/picard-tools-1.141/picard.jar MarkDuplicates\
	INPUT={input} OUTPUT={output.temp} \
	METRICS_FILE={output.qc} ASSUME_SORTED=true \
	VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=false

	"""
51
52
53
54
55
56
57
		shell:
			"""
			samtools view -@ {threads} -h {input} |\
                        grep -vF -f {config[sponges]} - |\
                        samtools view -@ {threads} -b -o {output.flt} -

			"""
70
71
72
73
74
75
76
		shell:
                        """
                        samtools view -@ {threads} -F1804 -b -o {output.final_bam} {input}
                        samtools index {output.final_bam} {output.final_bai}

			samtools sort -@ {threads} -n -T {params.prefix} -o {output.tmp_sort} {input}
                        """	
87
88
89
90
91
92
93
94
shell:
	"""
	java -Xmx4G -jar {config[path]}/picard-tools-1.141/picard.jar MarkDuplicates\
	INPUT={input} OUTPUT={output.flt} \
	METRICS_FILE={output.qc} ASSUME_SORTED=true \
	VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=false

	"""
107
108
109
110
111
112
113
		shell:
                        """
                        samtools view -@ {threads} -F1804 -b -o {output.final_bam} {input}
                        samtools index {output.final_bam} {output.final_bai}

			samtools sort -@ {threads} -n -T {params.prefix} -o {output.tmp_sort} {input}
                        """	
122
123
124
125
126
127
	shell:
		"""
		bedtools bamtobed -i {input} | cut -f1-3 | \
		{config[sort]} --parallel={threads} -k1,1 -k2,2n | \
                gzip -c > {output.bed} 
		"""
9
shell:"samtools flagstat {input.final} > {output.final_mapstats}"
18
shell:"samtools flagstat {input.raw} > {output.raw_mapstats}"
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
	shell:
		"""
		samtools sort -n -@ {threads} -T {params.prefix} -o {output.tmp_sort_bam} {input.flt}
		echo 1 | \
                     awk '{{print "#Total\tDistinct\tOne\tTwo\tNRF\tPBC1\tPBC2"}}' \
                     > {output.lib_complexity}

		bedtools bamtobed -bedpe -i {output.tmp_sort_bam} | \
                     awk 'BEGIN{{OFS="\t"}}{{print $1,$2,$4,$6,$9,$10}}' | \
                     grep -v 'chrM' | sort | uniq -c | \
                     awk 'BEGIN{{mt=0;m0=0;m1=0;m2=0; OFS="\t"}}
                               ($1==1){{m1=m1+1}} ($1==2){{m2=m2+1}} {{m0=m0+1}} {{mt=mt+$1}}
                          END{{print mt,m0,m1,m2,m0/mt,m1/m0,m1/m2}}'\
                     >> {output.lib_complexity}
		"""
9
shell: "samtools flagstat {input.final} > {output.final_mapstats}"
18
shell: "samtools flagstat {input.raw} > {output.raw_mapstats}"
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
	shell:
		"""
		samtools sort -n -@ {threads} -T {params.prefix} -o {output.tmp_sort_bam} {input.flt}
		echo 1 | \
                     awk '{{print "#Total\tDistinct\tOne\tTwo\tNRF\tPBC1\tPBC2"}}' \
                     > {output.lib_complexity}

		bedtools bamtobed -i {output.tmp_sort_bam} | \
                     awk 'BEGIN{{OFS="\t"}}{{print $1,$2,$3,$6}}' | \
                     grep -v 'chrM' | sort | uniq -c | \
                     awk 'BEGIN{{mt=0;m0=0;m1=0;m2=0; OFS="\t"}}
                               ($1==1){{m1=m1+1}} ($1==2){{m2=m2+1}} {{m0=m0+1}} {{mt=mt+$1}}
                          END{{if(m2>0) {{print mt,m0,m1,m2,m0/mt,m1/m0,m1/m2}}else{{print mt,m0,m1,m2,m0/mt,m1/m0,"-INF"}}}}'\
                     >> {output.lib_complexity}
		"""
11
12
13
14
15
16
shell:
	"""
	bamToBed -i {input} | awk 'BEGIN{{OFS="\\t"}}{{$4="N";$5="1000";print $0}}' |\
	tee {output.temp_tagAlign} |\
	gzip -c > {output.tagAlign}
	"""
From line 11 of modules/3_xcor_paired
27
28
29
30
31
32
33
34
35
	shell:
		"""
		grep -v 'chrM' {input.temp_tagAlign} |\
		shuf -n 15000000  |awk 'BEGIN{{OFS="\\t"}}{{$4="N";$5="1000";print $0}}'  > {output.subsample} 

		Rscript {config[path]}/run_spp_nodups.R -c={output.subsample} -p={threads} -filtchr=chrM -savp={output.cc_plot} -out={output.cc_score} > /dev/null 2>&1
        	sed -r  's/,[^\\t]+//g' {output.cc_score} > {output.temp_cc_score} 
		cp {output.temp_cc_score} {output.cc_score}		
		"""
From line 27 of modules/3_xcor_paired
11
12
13
14
15
16
shell:
	"""
	bamToBed -i {input} | awk 'BEGIN{{OFS="\\t"}}{{$4="N";$5="1000";print $0}}' |\
	tee {output.temp_tagAlign} |\
	gzip -c > {output.tagAlign}
	"""
From line 11 of modules/3_xcor_single
27
28
29
30
31
32
33
34
35
	shell:
		"""
		grep -v 'chrM' {input.temp_tagAlign} |\
		shuf -n 15000000  > {output.subsample} 

		Rscript {config[path]}/run_spp_nodups.R -c={output.subsample} -p={threads} -filtchr=chrM -savp={output.cc_plot} -out={output.cc_score} > /dev/null 2>&1
        	sed -r  's/,[^\\t]+//g' {output.cc_score} > {output.temp_cc_score} 
		cp {output.temp_cc_score} {output.cc_score}		
		"""
From line 27 of modules/3_xcor_single
13
14
15
16
17
18
19
20
21
22
23
	shell:
		"""
		#If the file is too big '--random-source' option will fail. That's why the '40000000'
		zcat {input} | shuf --random-source=<(zcat {input} | head -40000000) > {output.shuf}
		{config[split]} -d -nl/2 --additional-suffix=\".bed\" {output.shuf} bed_files/psr_{wildcards.sample}.

		{config[sort]} --parallel={threads} -S 2G \
                     -k1,1 -k2,2n {output.temp_psr1} | gzip -c > {output.psr1} 
		{config[sort]} --parallel={threads} -S 2G \
                     -k1,1 -k2,2n {output.temp_psr2} | gzip -c > {output.psr2} 
		"""
From line 13 of modules/4_pseudoreps
32
33
34
35
shell:
	"""
	zcat {input} | {config[sort]} --parallel={threads} -k1,1 -k2,2n | gzip -c > {output}
	"""
From line 32 of modules/4_pseudoreps
47
48
49
50
51
shell:
	"""
	zcat {input.input1} | {config[sort]} --parallel={threads} -k1,1 -k2,2n | gzip -c > {output.rep1}
	zcat {input.input2} | {config[sort]} --parallel={threads} -k1,1 -k2,2n | gzip -c > {output.rep2}
	"""
From line 47 of modules/4_pseudoreps
59
60
61
62
shell:
	"""
	zcat {input} | {config[sort]} --parallel={threads} -k1,1 -k2,2n | gzip -c > {output}
	"""
From line 59 of modules/4_pseudoreps
32
33
34
35
36
37
38
run:
	import os, sys, time,subprocess
	#Calling narrowPeaks	
	shell("""
	fraglen=`cat {input.cc_scores}| cut -f3`

	{config[macs2]} callpeak \
409
410
411
412
413
414
415
416
417
418
419
420
421
422
run:
	import os, sys, time,subprocess
	sum_len = 0
	n_files = 0
	for files in {input.cc_scores}:
		for file in files:
			n_files+=1
			with open(file,'r') as f:
				sum_len+=int(f.readlines()[0].split("\t")[2])
	fraglen=int(sum_len/n_files)

	#Calling narrowPeaks	
	shell("""
	{config[macs2]} callpeak \
589
590
591
592
593
594
595
596
597
598
599
600
601
602
run:
	import os, sys, time,subprocess
	sum_len = 0
	n_files = 0
	for files in {input.cc_scores}:
		for file in files:
			n_files+=1
			with open(file,'r') as f:
				sum_len+=int(f.readlines()[0].split("\t")[2])
	fraglen=int(sum_len/n_files)

	#Calling narrowPeaks	
	shell("""
	{config[macs2]} callpeak \
769
770
771
772
773
774
775
776
777
778
779
780
781
782
run:
	import os, sys, time,subprocess
	sum_len = 0
	n_files = 0
	for files in {input.cc_scores}:
		for file in files:
			n_files+=1
			with open(file,'r') as f:
				sum_len+=int(f.readlines()[0].split("\t")[2])
	fraglen=int(sum_len/n_files)

	#Calling narrowPeaks	
	shell("""
	{config[macs2]} callpeak \
33
34
35
36
37
38
39
run:
	import os, sys, time,subprocess
	#Calling narrowPeaks	
	shell("""
	fraglen=`cat {input.cc_scores}| cut -f3`

	{config[macs2]} callpeak \
412
413
414
415
416
417
418
419
420
421
422
423
424
425
run:
	import os, sys, time,subprocess
	sum_len = 0
	n_files = 0
	for files in {input.cc_scores}:
		for file in files:
			n_files+=1
			with open(file,'r') as f:
				sum_len+=int(f.readlines()[0].split("\t")[2])
	fraglen=int(sum_len/n_files)

	#Calling narrowPeaks	
	shell("""
	{config[macs2]} callpeak \
32
33
34
35
36
37
38
run:
	import os, sys, time,subprocess
	#Calling narrowPeaks	
	shell("""
	fraglen=`cat {input.cc_scores} | cut -f3`

	{config[macs2]} callpeak \
435
436
437
438
439
440
441
442
443
444
445
446
447
448
run:
	import os, sys, time,subprocess
	sum_len = 0
	n_files = 0
	for files in {input.cc_scores}:
		for file in files:
			n_files+=1
			with open(file,'r') as f:
				sum_len+=int(f.readlines()[0].split("\t")[2])
	fraglen=int(sum_len/n_files)

	#Calling narrowPeaks	
	shell("""
	{config[macs2]} callpeak \
34
35
36
37
38
39
40
run:
	import os, sys, time,subprocess
	#Calling narrowPeaks	
	shell("""
	fraglen=`cat {input.cc_scores} | cut -f3`

	{config[macs2]} callpeak \
413
414
415
416
417
418
419
420
421
422
423
424
425
426
run:
	import os, sys, time,subprocess
	sum_len = 0
	n_files = 0
	for files in {input.cc_scores}:
		for file in files:
			n_files+=1
			with open(file,'r') as f:
				sum_len+=int(f.readlines()[0].split("\t")[2])
	fraglen=int(sum_len/n_files)

	#Calling narrowPeaks	
	shell("""
	{config[macs2]} callpeak \
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
	shell:
		"""
		bedtools intersect \
                     -a {input.narrowPeak} -b {input.narrowPeak_psr1} -f 0.50 -F 0.50 -e -u |\
                     bedtools intersect \
                     -a stdin -b {input.narrowPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.narrow}

		sort -k1,1 -k2,2n -o finalPeaks/tmp_{params.prefix}_narrow_bb {output.narrow}
		{config[path]}/bedToBigBed -type='bed6+4' -as={config[path]}/narrowPeak.as finalPeaks/tmp_{params.prefix}_narrow_bb {config[chrom]} {output.narrowPeak_bb}
		rm -f finalPeaks/tmp_{params.prefix}_narrow_bb

		bedtools intersect \
                     -a {input.broadPeak} -b {input.broadPeak_psr1} -f 0.50 -F 0.50 -e -u |\
                     bedtools intersect \
                     -a stdin -b {input.broadPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.broad}

		sort -k1,1 -k2,2n -o finalPeaks/tmp_{params.prefix}_broad_bb {output.broad}
		{config[path]}/bedToBigBed -type='bed6+4' -as={config[path]}/broadPeak.as finalPeaks/tmp_{params.prefix}_broad_bb {config[chrom]} {output.broadPeak_bb}
		rm -f finalPeaks/tmp_{params.prefix}_broad_bb

		bedtools intersect \
                     -a {input.gappedPeak} -b {input.gappedPeak_psr1} -f 0.50 -F 0.50 -e -u |\
                     bedtools intersect \
                     -a stdin -b {input.gappedPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.gapped}

		sort -k1,1 -k2,2n -o finalPeaks/tmp_{params.prefix}_gapped_bb {output.gapped}
		{config[path]}/bedToBigBed -type='bed6+4' -as={config[path]}/gappedPeak.as finalPeaks/tmp_{params.prefix}_gapped_bb {config[chrom]} {output.gappedPeak_bb}
		rm -f finalPeaks/tmp_{params.prefix}_gapped_bb

		sleep 120
		"""
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
	shell:
		"""
		bedtools intersect \
                     -a {input.narrowPeak} -b {input.narrowPeak_psr1} -f 0.50 -F 0.50 -e -u |\
                     bedtools intersect \
                     -a stdin -b {input.narrowPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.narrow}

		sort -k1,1 -k2,2n -o finalPeaks/tmp_{params.prefix}_narrow_bb {output.narrow}
		{config[path]}/bedToBigBed -type='bed6+4' -as={config[path]}/narrowPeak.as finalPeaks/tmp_{params.prefix}_narrow_bb {config[chrom]} {output.narrowPeak_bb}
		rm -f finalPeaks/tmp_{params.prefix}_narrow_bb

		bedtools intersect \
                     -a {input.broadPeak} -b {input.broadPeak_psr1} -f 0.50 -F 0.50 -e -u |\
                     bedtools intersect \
                     -a stdin -b {input.broadPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.broad}

		sort -k1,1 -k2,2n -o finalPeaks/tmp_{params.prefix}_broad_bb {output.broad}
		{config[path]}/bedToBigBed -type='bed6+4' -as={config[path]}/broadPeak.as finalPeaks/tmp_{params.prefix}_broad_bb {config[chrom]} {output.broadPeak_bb}
		rm -f finalPeaks/tmp_{params.prefix}_broad_bb

		bedtools intersect \
                     -a {input.gappedPeak} -b {input.gappedPeak_psr1} -f 0.50 -F 0.50 -e -u |\
                     bedtools intersect \
                     -a stdin -b {input.gappedPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.gapped}

		sort -k1,1 -k2,2n -o finalPeaks/tmp_{params.prefix}_gapped_bb {output.gapped}
		{config[path]}/bedToBigBed -type='bed6+4' -as={config[path]}/gappedPeak.as finalPeaks/tmp_{params.prefix}_gapped_bb {config[chrom]} {output.gappedPeak_bb}
		rm -f finalPeaks/tmp_{params.prefix}_gapped_bb

		sleep 120
		"""
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
	shell:
		"""
		bedtools intersect \
                     -a {input.narrowPeak} -b {input.narrowPeak_psr1} -f 0.50 -F 0.50 -e -u |\
                     bedtools intersect \
                     -a stdin -b {input.narrowPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.narrow}

		sort -k1,1 -k2,2n -o finalPeaks/tmp_{params.prefix}_narrow_bb {output.narrow}
		{config[path]}/bedToBigBed -type='bed6+4' -as={config[path]}/narrowPeak.as finalPeaks/tmp_{params.prefix}_narrow_bb {config[chrom]} {output.narrowPeak_bb}
		rm -f finalPeaks/tmp_{params.prefix}_narrow_bb

		bedtools intersect \
                     -a {input.broadPeak} -b {input.broadPeak_psr1} -f 0.50 -F 0.50 -e -u |\
                     bedtools intersect \
                     -a stdin -b {input.broadPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.broad}

		sort -k1,1 -k2,2n -o finalPeaks/tmp_{params.prefix}_broad_bb {output.broad}
		{config[path]}/bedToBigBed -type='bed6+4' -as={config[path]}/broadPeak.as finalPeaks/tmp_{params.prefix}_broad_bb {config[chrom]} {output.broadPeak_bb}
		rm -f finalPeaks/tmp_{params.prefix}_broad_bb

		bedtools intersect \
                     -a {input.gappedPeak} -b {input.gappedPeak_psr1} -f 0.50 -F 0.50 -e -u |\
                     bedtools intersect \
                     -a stdin -b {input.gappedPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.gapped}

		sort -k1,1 -k2,2n -o finalPeaks/tmp_{params.prefix}_gapped_bb {output.gapped}
		{config[path]}/bedToBigBed -type='bed6+4' -as={config[path]}/gappedPeak.as finalPeaks/tmp_{params.prefix}_gapped_bb {config[chrom]} {output.gappedPeak_bb}
		rm -f finalPeaks/tmp_{params.prefix}_gapped_bb

		sleep 120
		"""
11
12
13
14
15
16
shell:
	"""
	bwa mem -M -t {threads} {config[genome]} {input} | gzip -c > {output.sam}

	samtools view -@ {threads} -Su {output.sam} | samtools sort -@ {threads} -T {params.prefix} -o  {output.bam}
	"""
 9
10
11
12
13
	shell:
		"""
			samtools view -@ {threads} -F 1804 -q 20 \
                        -b {input} -o {output.flt}
		"""
24
25
26
27
28
29
30
31
shell:
	"""
	java -Xmx4G -jar {config[path]}/picard-tools-1.141/picard.jar MarkDuplicates\
	INPUT={input} OUTPUT={output.temp} \
	METRICS_FILE={output.qc} ASSUME_SORTED=true \
	VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=false

	"""
39
40
41
42
43
44
45
		shell:
			"""
			samtools view -@ {threads} -h {input} |\
                        grep -vF -f {config[sponges]} - |\
                        samtools view -@ {threads} -b -o {output.flt} -

			"""
58
59
60
61
62
63
64
		shell:
                        """
                        samtools view -@ {threads} -F1804 -b -o {output.final_bam} {input}
                        samtools index {output.final_bam} {output.final_bai}

			samtools sort -@ {threads} -n -T {params.prefix} -o {output.tmp_sort} {input}
                        """	
75
76
77
78
79
80
81
82
shell:
	"""
	java -Xmx4G -jar {config[path]}/picard-tools-1.141/picard.jar MarkDuplicates\
	INPUT={input} OUTPUT={output.flt} \
	METRICS_FILE={output.qc} ASSUME_SORTED=true \
	VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=false

	"""
 95
 96
 97
 98
 99
100
101
		shell:
                        """
                        samtools view -@ {threads} -F1804 -b -o {output.final_bam} {input}
                        samtools index {output.final_bam} {output.final_bai}

			samtools sort -@ {threads} -n -T {params.prefix} -o {output.tmp_sort} {input}
                        """	
110
111
112
113
114
115
116
shell:
	"""
	bedtools bamtobed -i {input} | \
	awk 'BEGIN{{OFS="\t"}}{{$4="N";$5="1000";print $0}}' | \
	{config[sort]} --parallel={threads} -k1,1 -k2,2n | \
	gzip -nc > {output.tagAlign} 
	"""
9
shell: "samtools flagstat {input.final} > {output.final_mapstats}"
SAMtools From line 9 of TF/2_qc_single
18
shell: "samtools flagstat {input.raw} > {output.raw_mapstats}"
SAMtools From line 18 of TF/2_qc_single
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
	shell:
		"""
		samtools sort -n -@ {threads} -T {params.prefix} -o {output.tmp_sort_bam} {input.flt}
		echo 1 | \
                     awk '{{print "#TotalReadPairs\tDistinctReadPairs\tOneReadPair\tTwoReadPairs\tNRF=Distinct/Total\tPBC1=OnePair/Distinct\tPBC2=OnePair/TwoPair"}}' \
                     > {output.lib_complexity}

		bedtools bamtobed -i {output.tmp_sort_bam} | \
                     awk 'BEGIN{{OFS="\t"}}{{print $1,$2,$3,$6}}' | \
                     grep -v 'chrM' | sort | uniq -c | \
                     awk 'BEGIN{{mt=0;m0=0;m1=0;m2=0; OFS="\t"}}
                               ($1==1){{m1=m1+1}} ($1==2){{m2=m2+1}} {{m0=m0+1}} {{mt=mt+$1}}
                          END{{if(m2>0) {{print mt,m0,m1,m2,m0/mt,m1/m0,m1/m2}}else{{print mt,m0,m1,m2,m0/mt,m1/m0,"-INF"}}}}'\
                     >> {output.lib_complexity}
		"""
10
11
12
13
14
15
16
17
18
19
20
21
	shell:
		"""
		NREADS=15000000

		zcat {input.tagAlign} | grep -v 'chrM' |\
		shuf -n $NREADS | gzip -nc > {output.subsample} 

		Rscript {config[path]}/run_spp_nodups.R -c={output.subsample} -p={threads} -filtchr=chrM -savp={output.cc_plot} -out={output.cc_score} > /dev/null 2>&1
        	sed -r  's/,[^\\t]+//g' {output.cc_score} > {output.cc_score}.tmp 
		mv {output.cc_score}.tmp {output.cc_score}		
		sleep 120
		"""
From line 10 of TF/3_xcor_single
13
14
15
16
17
18
19
20
21
22
23
	shell:
		"""
		#If the file is too big '--random-source' option will fail. That's why the '40000000'
		zcat {input} | shuf --random-source=<(zcat {input} | head -40000000) > {output.shuf}
		{config[split]} -d -nl/2 --additional-suffix=\".tagAlign\" {output.shuf} bed_files/psr_{wildcards.sample}.

		{config[sort]} --parallel={threads} -S 2G \
                     -k1,1 -k2,2n {output.temp_psr1} | gzip -c > {output.psr1}
		{config[sort]} --parallel={threads} -S 2G \
                     -k1,1 -k2,2n {output.temp_psr2} | gzip -c > {output.psr2}
		"""
From line 13 of TF/4_pseudoreps
32
33
34
35
shell:
	"""
	zcat {input} | {config[sort]} --parallel={threads} -k1,1 -k2,2n | gzip -c > {output}
	"""
From line 32 of TF/4_pseudoreps
43
44
45
46
shell:
	"""
	zcat {input} | {config[sort]} --parallel={threads} -k1,1 -k2,2n | gzip -c > {output}
	"""
From line 43 of TF/4_pseudoreps
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
shell:
	"""
	fraglen=`cat {input.cc_scores} | cut -f3`

	{config[macs2]} callpeak \
	-t {input.ip} -c {input.control}\
	-f BED -n peaks/{params.prefix}\
	-g {config[macs_g]} -p 1e-2 --nomodel --shift 0 --extsize $fraglen \
	--keep-dup all -B --SPMR

	sort -k 8gr,8gr {output.narrowPeak} |\
	awk 'BEGIN{{OFS="\t"}}{{$4="Peak_"NR ; print $0}}'| \
	gzip -nc > {output.narrowPeak_compressed}

	rm -f peaks/{params.prefix}_peaks.xls peaks/{params.prefix}_summits.bed
	touch {output.checkpoint}
	"""
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
shell:
	"""
	fraglen=`cat {input.cc_scores} | cut -f3`

	{config[macs2]} callpeak \
	-t {input.ip_psr1} -c {input.control}\
	-f BED -n peaks/{params.prefix}\
	-g {config[macs_g]} -p 1e-2 --nomodel --shift 0 --extsize $fraglen \
	--keep-dup all -B --SPMR

	sort -k 8gr,8gr {output.narrowPeak_psr1} |\
	awk 'BEGIN{{OFS="\t"}}{{$4="Peak_"NR ; print $0}}'| \
	gzip -nc > {output.narrowPeak_psr1_compressed}

	rm -f peaks/{params.prefix}_peaks.xls peaks/{params.prefix}_summits.bed

	{config[macs2]} callpeak \
	-t {input.ip_psr2} -c {input.control}\
	-f BED -n peaks/{params.prefix_2}\
	-g {config[macs_g]} -p 1e-2 --nomodel --shift 0 --extsize $fraglen \
	 --keep-dup all -B --SPMR

	sort -k 8gr,8gr {output.narrowPeak_psr2} |\
	awk 'BEGIN{{OFS="\t"}}{{$4="Peak_"NR ; print $0}}'| \
	gzip -nc > {output.narrowPeak_psr2_compressed}

	rm -f peaks/{params.prefix_2}_peaks.xls peaks/{params.prefix_2}_summits.bed
	touch {output.checkpoint}
	"""
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
shell:
	"""
	fraglen=`cat {input.cc_scores} | cut -f3`

	{config[macs2]} callpeak \
	-t {input.ip} -c {input.control}\
	-f BED -n peaks/{params.prefix}\
	-g {config[macs_g]} -p 1e-2 --broad --nomodel --shift 0 --extsize $fraglen \
	--keep-dup all

	sort -k 8gr,8gr {output.broadPeak} | \
	awk 'BEGIN{{OFS="\t"}}{{$4="Peak_"NR ; print $0}}'| \
	gzip -nc > {output.broadPeak_compressed}

	sort -k 14gr,14gr {output.broadPeak} | \
	awk 'BEGIN{{OFS="\t"}}{{$4="Peak_"NR ; print $0}}'| \
	gzip -nc > {output.gappedPeak_compressed}


	rm -f peaks/{params.prefix}_peaks.xls peaks/{params.prefix}_summits.bed

	"""
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
shell:
	"""
	{config[macs2]} bdgcmp  \
	-t {input.treat} -c {input.control_lambda} \
	--outdir peaks/signals -o {params.prefix}_FE.bdg -m FE

	bedtools slop -i {output.fe_bdg} -g {config[chrom]} -b 0 |\
	awk '{{ if($3!=-1) print $0}}' | \
	{config[path]}/bedClip stdin {config[chrom]} {output.fc_signal_bdg}


	{config[path]}/bedGraphToBigWig {output.fc_signal_bdg} {config[chrom]} {output.fc_signal}


	chipReads=$(zcat {input.ip} | wc -l | awk '{{printf "%f", $1/1000000}}');
	controlReads=$(zcat {input.control} | wc -l | awk '{{printf "%f", $1/1000000}}');

	sval=$(echo "${{chipReads}} ${{controlReads}}" | awk '$1>$2{{printf "%f",$2}} $1<=$2{{printf "%f",$1}}');

	macs2 bdgcmp -t {input.treat} -c {input.control_lambda} --outdir peaks/signals -o {params.prefix}_ppois.bdg -m ppois -S ${{sval}}

	slopBed -i {output.ppois} -g {config[chrom]} -b 0 | \
	awk '{{if ($3 != -1) print $0}}' |  \
	bedClip stdin {config[chrom]} {output.pval_bdg}

	bedGraphToBigWig {output.pval_bdg} {config[chrom]} {output.pval}

	"""
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
shell:
	"""
	fraglen=`cat {input.cc_scores} | cut -f3`

	{config[macs2]} callpeak \
	  -t {input.ip_psr1} -c {input.control}\
	  -f BED -n peaks/{params.prefix}\
	  -g {config[macs_g]} -p 1e-2 --broad --nomodel --shift 0 --extsize $fraglen \
	  --keep-dup all

	sort -k 8gr,8gr {output.broadPeak_psr1} | \
	awk 'BEGIN{{OFS="\t"}}{{$4="Peak_"NR ; print $0}}'| \
	gzip -nc > {output.broadPeak_psr1_compressed}

	sort -k 14gr,14gr {output.broadPeak_psr1} | \
	awk 'BEGIN{{OFS="\t"}}{{$4="Peak_"NR ; print $0}}'| \
	gzip -nc > {output.gappedPeak_psr1_compressed}


	rm -f peaks/{params.prefix}_peaks.xls peaks/{params.prefix}_summits.bed


	{config[macs2]} callpeak \
	-t {input.ip_psr2} -c {input.control}\
	-f BED -n peaks/{params.prefix_2}\
	-g {config[macs_g]} -p 1e-2 --broad --nomodel --shift 0 --extsize $fraglen \
	--keep-dup all

	sort -k 8gr,8gr {output.broadPeak_psr2} | \
	awk 'BEGIN{{OFS="\t"}}{{$4="Peak_"NR ; print $0}}'| \
	gzip -nc > {output.broadPeak_psr2_compressed}

	sort -k 14gr,14gr {output.broadPeak_psr2} | \
	awk 'BEGIN{{OFS="\t"}}{{$4="Peak_"NR ; print $0}}'| \
	gzip -nc > {output.gappedPeak_psr2_compressed}


	rm -f peaks/{params.prefix_2}_peaks.xls peaks/{params.prefix_2}_summits.bed
	"""
253
254
255
256
257
258
259
260
261
262
263
264
265
run:
	import os, sys, time,subprocess
	sum_len = 0
	n_files = 0
	for files in {input.cc_scores}:
		for file in files:
			n_files+=1
			with open(file,'r') as f:
				sum_len+=int(f.readlines()[0].split("\t")[2])
	fraglen=int(sum_len/n_files)

	shell("""
	{config[macs2]} callpeak \
295
296
297
298
299
300
301
302
303
304
305
306
307
308
run:
	import os, sys, time,subprocess
	sum_len = 0
	n_files = 0
	for files in {input.cc_scores}:
		for file in files:
			n_files+=1
			with open(file,'r') as f:
				sum_len+=int(f.readlines()[0].split("\t")[2])
	fraglen=int(sum_len/n_files)

	shell("""

	{config[macs2]} callpeak \
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
	shell:
		"""
		{config[macs2]} bdgcmp  \
                 -t {input.treat} -c {input.control_lambda} \
                 --outdir peaks/signals -o {params.prefix}_FE.bdg -m FE

		bedtools slop -i {output.fe_bdg} -g {config[chrom]} -b 0 |\
		awk '{{ if($3!=-1) print $0}}' | \
                {config[path]}/bedClip stdin {config[chrom]} {output.fc_signal_bdg}


		{config[path]}/bedGraphToBigWig {output.fc_signal_bdg} {config[chrom]} {output.fc_signal}


		chipReads=$(zcat {input.ip} | wc -l | awk '{{printf "%f", $1/1000000}}');
		controlReads=$(zcat {input.control} | wc -l | awk '{{printf "%f", $1/1000000}}');

		sval=$(echo "${{chipReads}} ${{controlReads}}" | awk '$1>$2{{printf "%f",$2}} $1<=$2{{printf "%f",$1}}');

		macs2 bdgcmp -t {input.treat} -c {input.control_lambda} --outdir peaks/signals -o {params.prefix}_ppois.bdg -m ppois -S ${{sval}}

		slopBed -i {output.ppois} -g {config[chrom]} -b 0 | \
		awk '{{if ($3 != -1) print $0}}' |  \
		bedClip stdin {config[chrom]} {output.pval_bdg}

		bedGraphToBigWig {output.pval_bdg} {config[chrom]} {output.pval}

		"""
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
shell:
	"""
	fraglen=`cat {input.cc_scores} | cut -f3`

	#REP
	Rscript {config[path]}/run_spp.R -c={input.ip} -i={input.control} -p={threads} -npeak=300000 -odir=peaks_spp/{params.prefix} -speak=$fraglen -savr -savp -rf -out=peaks_spp/{params.prefix}/{params.prefix}.tagAlign.ccscores 

	zcat {output.narrowPeak_raw} | \
	awk 'BEGIN{{OFS="\t"}}{{ if ($2<0) $2=0; print $1,int($2),int($3),$4,$5,$6,$7,$8,$9,$10;}}' | \
	gzip -f -nc > {output.narrowPeak_unfiltered}

	bedtools intersect -v -a <(zcat -f {output.narrowPeak_unfiltered}) -b <(zcat -f {config[path]}/blacklist_hg19.bed.gz) | \
	awk 'BEGIN{{OFS="\t"}} {{if ($5>1000) $5=1000; print $0}}' | \
	grep -P 'chr[\dXY]+[ \t]' | \
	gzip -nc > {output.narrowPeak_final}


	#SELF-PSEUDOREP1
	Rscript {config[path]}/run_spp.R -c={input.ip_psr1} -i={input.control} -p={threads} -npeak=300000 -odir=peaks_spp/psr00_{params.prefix} -speak=$fraglen -savr -savp -rf -out=peaks_spp/psr00_{params.prefix}/psr_{params.prefix}.00.tagAlign.ccscores 

	zcat {output.narrowPeak_raw_0} | \
	awk 'BEGIN{{OFS="\t"}}{{ if ($2<0) $2=0; print $1,int($2),int($3),$4,$5,$6,$7,$8,$9,$10;}}' | \
	gzip -f -nc > {output.narrowPeak_unfiltered_0}

	bedtools intersect -v -a <(zcat -f {output.narrowPeak_unfiltered_0}) -b <(zcat -f {config[path]}/blacklist_hg19.bed.gz) | \
	awk 'BEGIN{{OFS="\t"}} {{if ($5>1000) $5=1000; print $0}}' | \
	grep -P 'chr[\dXY]+[ \t]' | \
	gzip -nc > {output.narrowPeak_final_0}

	#SELF-PSEUDOREP2
	Rscript {config[path]}/run_spp.R -c={input.ip_psr2} -i={input.control} -p={threads} -npeak=300000 -odir=peaks_spp/psr01_{params.prefix} -speak=$fraglen -savr -savp -rf -out=peaks_spp/psr01_{params.prefix}/psr_{params.prefix}.01.tagAlign.ccscores 

	zcat {output.narrowPeak_raw_1} | \
	awk 'BEGIN{{OFS="\t"}}{{ if ($2<0) $2=0; print $1,int($2),int($3),$4,$5,$6,$7,$8,$9,$10;}}' | \
	gzip -f -nc > {output.narrowPeak_unfiltered_1}

	bedtools intersect -v -a <(zcat -f {output.narrowPeak_unfiltered_1}) -b <(zcat -f {config[path]}/blacklist_hg19.bed.gz) | \
	awk 'BEGIN{{OFS="\t"}} {{if ($5>1000) $5=1000; print $0}}' | \
	grep -P 'chr[\dXY]+[ \t]' | \
	gzip -nc > {output.narrowPeak_final_1}
	"""
17
18
19
20
21
22
23
24
25
26
27
28
29
30
shell:
	"""
	idr --samples {input.narrowPeak_psr1} {input.narrowPeak_psr2} --peak-list {input.narrowPeak} --input-file-type narrowPeak --output-file idr/{params.prefix} --rank signal.value --soft-idr-threshold {params.threshold} --plot --use-best-multisummit-IDR

	IDR_THRESH_TRANSFORMED=`awk -v p={params.threshold} 'BEGIN{{print -log(p)/log(10)}}'`

	awk 'BEGIN{{OFS="\t"}} $12>='"$IDR_THRESH_TRANSFORMED"' {{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}}' idr/{params.prefix}.overlapped-peaks.txt | sort | uniq | sort -k7n,7n | gzip -nc > {output.idr_raw}

	NPEAKS_IDR=`zcat {output.idr_raw} | wc -l`

	bedtools intersect -v -a {output.idr_raw} -b {config[path]}/blacklist_hg19.bed.gz | grep -P 'chr[\dXY]+[ \t]' | awk 'BEGIN{{OFS="\t"}} {{if ($5>1000) $5=1000; print $0}}| gzip -nc > {output.idr_final}

	sleep 120
	"""
BEDTools From line 17 of TF/6_idr
14
15
16
17
18
19
20
21
22
	shell:
		"""
		bedtools intersect \
                     -a {input.narrowPeak} -b {input.narrowPeak_psr1} -f 0.50 -F 0.50 -e -u|\
                     bedtools intersect \
                     -a stdin -b {input.narrowPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.narrow}

		sleep 120
		"""
BEDTools From line 14 of TF/6_overlap
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
	shell:
		"""
		bedtools intersect \
                     -a {input.broadPeak} -b {input.broadPeak_psr1} -f 0.50 -F 0.50 -e -u|\
                     bedtools intersect \
                     -a stdin -b {input.broadPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.broad}


		bedtools intersect \
                     -a {input.gappedPeak} -b {input.gappedPeak_psr1} -f 0.50 -F 0.50 -e -u|\
                     bedtools intersect \
                     -a stdin -b {input.gappedPeak_psr2} -f 0.50 -F 0.50 -e -u > {output.gapped}

		sleep 120
		"""
BEDTools From line 41 of TF/6_overlap
ShowHide 68 more snippets with no or duplicated tags.

Free

Created: 8mo ago
Updated: 8mo ago
Maitainers: public
URL: https://github.com/amc-heme/cellranger-snakemake-gke
Name: cellranger-snakemake-gke
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows