Mega non-model WGS Snakeflow

public public 1yr ago 0 bookmarks

mega-non-model-wgs-snakeflow

There are two main changes:

  1. The genomics databases are now created directly from the gvcf sections. As a consequence, it is not necessary for gvcfs at all chromosomes and scaffold groups to be complete before going on to the genomics DB stage. This is helpful when there are some very long chromosomes. This change entails no difference, otherwise, to the user’s resposibilities.

  2. The GenotypeGVCFs step is now done in a scattered fashion over intervals that are all less than some desired length. This means that instead of having to wait a super long time for a long chromosome to finish this step, you can sort of normalize the times required for each instance of this step, but effectively chopping each chromosome into, say, 5 megabase chunks, and then dispatching each of those jobs separately, and stitching them all together at the end. This also means that the workflow can be designed so that none of these jobs exceed 24 hours, so that they can be run on the standard queue on most clusters. tq.gz

Condensed DAG for the workflow

Here is a DAG for the workflow on the test data in .test , condensed into an easier-to-look-at picture by the condense_dag() function in Eric’s SnakemakeDagR package.

snakemake workflow

  • fastqc on both reads

  • don’t bother with single end

  • add adapters so illumina clip can work

  • benchmark each rule

  • use genomicsDBimport

  • allow for merging of lots of small scaffolds into genomicsDB

Code Snippets

29
30
31
shell:
	" bam clipOverlap --in {input} --out {output} --stats 2> {log.clip} && "
	" samtools index {output} 2> {log.index}"
41
42
shell:
    " for i in {params.sams}; do echo $i; done > {output.sample_list} 2>{log} "
64
65
66
67
68
69
70
71
72
73
74
75
shell:
    " IGRP={wildcards.igrp};                           "  
    " if [ $IGRP = \"__ALL\" ]; then                   "   # just hard-link the files in this case
    "    (ln  {input.indel} {output.vcf};              "
    "    ln  {input.indel_idx} {output.idx}) 2>{log};  "
    " else                                        "
    "   gatk SelectVariants -R {input.ref}        "
    "     -V {input.indel}                        "
    "     -O {output.vcf}                         "
    "     --sample-name {input.sample_list}       "
    "     --exclude-non-variants 2>{log};         "
    " fi                                          "
95
96
97
98
shell:
	"SAMP=$(bcftools query -l {input.indel} | head -n 1); "
	" bcftools view -Oz -s $SAMP {input.indel} > {output.vcf} 2> {log}; "
	" bcftools index -t {output.vcf} 2>> {log}; "
116
117
118
shell:
    " (bcftools concat {params.opts} -Oz {input} > {output.vcf} 2> {log}; "
    " bcftools index -t {output.vcf})  2>> {log}; "
138
139
140
141
142
143
144
shell:
    "( wget -P {output.dir} https://storage.googleapis.com/gatk-software/package-archive/gatk/GenomeAnalysisTK-3.8-0-ge9d806836.tar.bz2 && "
    "  bzip2 -d  {output.dir}/GenomeAnalysisTK-3.8-0-ge9d806836.tar.bz2  && "
    "  tar --directory {output.dir} -xvf {output.dir}/GenomeAnalysisTK-3.8-0-ge9d806836.tar && "
    "  mv {output.dir}/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar {output.dir} && "
    "  if [ $(uname -s) = \"Darwin\" ]; then gatk3-register {output.jar}; fi"
    "  ) > {log} 2>&1 "
167
168
169
170
171
shell:
    "gatk3 {params.jopts} -T RealignerTargetCreator -nt {threads} "
    " -R resources/genome.fasta "
    " -o {output} "
    " -known {input.vcf} 2> {log} "
196
197
198
199
200
201
202
203
204
205
206
shell:
    "gatk3 {params.jopts}  \"-Dsamjdk.compression_level=9\" "
    " -T IndelRealigner  "
    " -R {input.fasta} "
    " -I {input.bam} "
    " -targetIntervals {input.intervals} "
    " -known {input.vcf} "
    " --consensusDeterminationModel KNOWNS_ONLY "
    " -LOD 0.4 " # this is recommended at https://github.com/broadinstitute/gatk-docs/blob/master/gatk3-methods-and-algorithms/Local_Realignment_around_Indels.md
    " -o {output.bam} "
    " > {log} 2>&1 "
21
22
wrapper:
    "0.59.2/bio/vep/annotate"
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
shell:
	" git log | head -n 150  > {params.config_dir}/latest-git-commits.txt;  "
	" mkdir -p results/qc_summaries/bqsr-round-{{0..{params.BQR}}}; "
	" for i in {{0..{params.BQR}}}; do cp -r results/bqsr-round-$i/qc/{{multiqc.html,bcftools_stats/*.txt}} results/qc_summaries/bqsr-round-$i/; done; "
	" for i in {{0..{params.BQR}}}; do tar -cvf results/bqsr-round-$i/qc.tar results/bqsr-round-$i/qc; gzip results/bqsr-round-$i/qc.tar;  done; "
	" for i in {{0..{params.BQR}}}; do tar -cvf results/bqsr-round-$i/benchmarks.tar results/bqsr-round-$i/benchmarks; gzip results/bqsr-round-$i/benchmarks.tar; done; "
	" for i in {{0..{params.BQR}}}; do tar -cvf results/bqsr-round-$i/logs.tar results/bqsr-round-$i/logs; gzip results/bqsr-round-$i/logs.tar; done; "
	"                                                                                                                  "
	" rclone copy --dry-run  --bwlimit 8650k . {params.rclone_base}  "
	" --include='{params.config_dir}/**' "
	" --include='results/qc_summaries/**' "
	" --include='results/bqsr-round-{{{params.comma_nums}}}/{{qc,benchmarks,logs}}.tar.gz' "
	" --include='results/bqsr-round-{{{params.comma_nums}}}/{{bcf,bq_recal_tables,bq_variants}}/**' "
	" --include='resources/**' "
	" {params.data_comm} "
	" --include='results/bqsr-round-{params.BQR}/gvcf/*' "
	" --include='results/{params.bamdir}/*' "
	" --include='results/bqsr-round-{params.BQR}/indel_realigned/**' "
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
shell:
	" git log | head -n 150  > config/latest-git-commits.txt;  "
	" mkdir -p results/qc_summaries/bqsr-round-{{0..{params.BQR}}}; "
	" for i in {{0..{params.BQR}}}; do cp -r results/bqsr-round-$i/qc/{{multiqc.html,bcftools_stats/*.txt}} results/qc_summaries/bqsr-round-$i/; done; "
	" for i in {{0..{params.BQR}}}; do tar -cvf results/bqsr-round-$i/qc.tar results/bqsr-round-$i/qc; gzip results/bqsr-round-$i/qc.tar;  done; "
	" for i in {{0..{params.BQR}}}; do tar -cvf results/bqsr-round-$i/benchmarks.tar results/bqsr-round-$i/benchmarks; gzip results/bqsr-round-$i/benchmarks.tar; done; "
	" for i in {{0..{params.BQR}}}; do tar -cvf results/bqsr-round-$i/logs.tar results/bqsr-round-$i/logs; gzip results/bqsr-round-$i/logs.tar; done; "
	"                                                                                                                  "
	" rclone copy --dry-run  --drive-stop-on-upload-limit . {params.rclone_base}  "
	" --include='config/**' "
	" --include='results/qc_summaries/**' "
	" --include='results/bqsr-round-{params.BQR}/{{qc,benchmarks,logs}}.tar.gz' "
	" --include='results/bqsr-round-{params.BQR}/bcf/**' "
	" --include='resources/**' "
	" --include='data/**' "
	" --include='results/bqsr-round-{params.BQR}/gvcf/*' "
	" --include='results/{params.bamdir}/*' "
	" --include='results/bqsr-round-{params.BQR}/overlap_clipped/*' "
19
20
shell:
	"workflow/scripts/qd_and_qual.sh {input.bcf} {output.qd} {output.qual} {log} "
SnakeMake From line 19 of rules/bqsr.smk
42
43
44
45
46
47
shell:
	"SAMP=$(bcftools query -l {input.bcf} | head -n 1); "
	" bcftools view -i 'QUAL >= {params.qual} && QD >= {params.qd}' "
	"     -Oz -s $SAMP {input.bcf} > {output.vcf} 2> {log}; "
	" bcftools index -t {output.vcf} 2>> {log}; "
	" bcftools stats {output.vcf} > {output.stats} 2>> {log}; "
66
67
68
69
70
71
72
shell:
	"gatk BaseRecalibrator "
	" -I {input.bam} "
	" -R {input.ref} "
	" --known-sites {input.vcf} "
	" --bqsr-baq-gap-open-penalty 30 "
	" -O {output}  2> {log} "
91
92
93
94
95
96
shell:
	"gatk --java-options \"-Dsamjdk.compression_level=9\" ApplyBQSR "
	" -R {input.ref} "
	" -I {input.bam} "
	" --bqsr-recal-file {input.rtable} "
	" -O {output.bam}  2> {log} "
 9
10
shell:
    " awk -v sg={wildcards.scaff_group} 'NR>1 && $1 == sg {{print $2}}' {input.scaff_groups} > {output} 2> {log};"
19
20
shell:
    " echo {wildcards.chromo} > {output} 2> {log};"
31
32
33
34
shell:
    " awk -v sgc={wildcards.sg_or_chrom} -v scat={wildcards.scatter} ' "
    "    NR>1 && $1 == sgc && $2==scat {{printf(\"%s:%s-%s\\n\", $3, $4, $5)}} "
    " ' {input.scatters_file} > {output} 2> {log};"
62
63
64
65
66
67
68
69
70
shell:
    "gatk --java-options \"{params.java_opts}\" HaplotypeCaller "
    " -R {input.ref} "
    " -I {input.bam} "
    " -O {output.gvcf} "
    " -L {input.interval_list} "
    " --native-pair-hmm-threads {threads} "
    " {params.conf_pars} "
    " -ERC GVCF > {log.stdout} 2> {log.stderr} "
88
89
90
shell:
    " bcftools concat {params.opts} -O z {input} > {output.gvcf} 2>{log}; "
    " bcftools index -t {output.gvcf} "
129
130
131
132
133
134
135
shell:
    " mkdir -p results/bqsr-round-{wildcards.bqsr_round}/genomics_db; "
    " gatk --java-options {params.java_opts} GenomicsDBImport "
    " $(echo {input.gvcfs} | awk '{{for(i=1;i<=NF;i++) printf(\" -V %s \", $i)}}') "
    " {params.my_opts} {params.db} 2>&1 | tee {log} > {log}.import_{params.import_num} && "
    " (echo $(({params.import_num} + 1)) > {output.counter}) && "
    " (for i in {params.histories}; do mkdir -p $(dirname $i); echo {params.import_num} $(date) > $i; done)"
169
170
171
172
173
174
175
176
shell:
    " mkdir -p results/bqsr-round-{wildcards.bqsr_round}/genomics_db; "
    " awk -v sg={wildcards.scaff_group} 'NR>1 && $1 == sg {{print $2}}' {input.scaff_groups} > {output.interval_list}; "
    " gatk --java-options {params.java_opts} GenomicsDBImport "
    " $(echo {input.gvcfs} | awk '{{for(i=1;i<=NF;i++) printf(\" -V %s \", $i)}}') "
    " {params.my_opts} {params.db} 2>&1 | tee {log} > {log}.import_{params.import_num} && "
    " (echo $(({params.import_num} + 1)) > {output.counter}) && "
    " (for i in {params.histories}; do mkdir -p $(dirname $i); echo {params.import_num} $(date) > $i; done) "
208
209
210
211
212
213
214
shell:
    " gatk --java-options {params.java_opts} GenotypeGVCFs "
    " {params.pextra} "
    " -L {input.scatters} "
    " -R {input.genome} "
    " -V gendb://{params.gendb} "
    " -O {output.vcf} > {log} 2> {log} "
232
233
234
shell:
    " (bcftools concat {params.opts} -Oz {input.vcf} > {output.vcf}; "
    " bcftools index -t {output.vcf})  2>{log}; "
257
258
259
260
261
shell:
    "(bcftools +setGT {input.vcf} -- -t q -n . -i 'FMT/DP=0 | (FMT/PL[:0]=0 & FMT/PL[:1]=0 & FMT/PL[:2]=0)' | "
    " bcftools +fill-tags - -- -t 'NMISS=N_MISSING' | "
    " bcftools view -Oz - > {output.vcf}; "
    " bcftools index -t {output.vcf}) 2> {log} "
280
281
282
shell:
    " (bcftools concat {params.opts} -Ob {input} > {output.bcf}; "
    " bcftools index {output.bcf})  2> {log}; "
302
303
304
shell:
    " (bcftools concat {params.opts} -Ob {input} > {output.bcf}; "
    " bcftools index {output.bcf})  2>{log}; "
40
41
script:
	"../scripts/sequence-scatter-bins.R"
 9
10
shell:
	" awk 'NF>0 {{clen = $3}} END {{print clen}}' {input} > {output} "
19
20
21
22
23
24
25
26
27
28
shell:
    " ("
    " printf \"sample\\tave_depth\\n\";   "
	" for i in {input.ss}; do "
	" FN=$(basename $i);    "
	" FN=${{FN/.txt/}};       "
	" awk -F\"\t\" -v f=$FN -v NumBases=$(cat {input.gLen}) '  "
	"   BEGIN {{OFS=\"\t\";}} "
	"   $2==\"bases mapped (cigar):\" {{sub(/\\.stats/, \"\", f); print f,  $3/NumBases}} "
	" ' $i; done) > {output}  "
52
53
54
55
56
57
58
59
60
61
62
shell:
	" ( "
	" OPT=$(awk 'NR>1 && $1==\"{wildcards.sample}\" {{ wc = \"{wildcards.cov}\"; if(wc == \"FD\") {{print \"NOSAMPLE\"; exit}} fract = wc / $NF; if(fract < 1) print fract; else print \"NOSAMPLE\"; }}' {input.dps});  "
	" if [ $OPT = \"NOSAMPLE\" ]; then "
	"     ln  {input.bam} {output.bam}; "
	"     ln  {input.bai} {output.bai}; " 
	" else "
	"     samtools view --subsample $OPT --subsample-seed 1  -b {input.bam} > {output.bam}; "
	"     samtools index {output.bam}; "
	" fi "
	" ) 2> {log} "
11
12
wrapper:
    "0.59.0/bio/gatk/selectvariants"
25
26
wrapper:
    "0.59.2/bio/gatk/variantfiltration"
38
39
wrapper:
    "0.59.2/bio/gatk/variantrecalibrator"
55
56
wrapper:
    "0.59.2/bio/picard/mergevcfs"
23
24
shell:
    " gatk SelectVariants -V {input.vcf}  -select-type SNP -O {output.vcf} > {log} 2>&1 "
41
42
shell:
    " gatk SelectVariants -V {input.vcf}  -select-type INDEL -O {output.vcf} > {log} 2>&1 "
60
61
62
63
64
65
66
67
68
69
70
shell:
    "gatk VariantFiltration "
    " -V {input.vcf} "
    "  -filter 'QD < 2.0' --filter-name 'QD2' "
    "  -filter 'QUAL < 30.0' --filter-name 'QUAL30' "
    "  -filter 'SOR > 3.0' --filter-name 'SOR3' "
    "  -filter 'FS > 60.0' --filter-name 'FS60' "
    "  -filter 'MQ < 40.0' --filter-name 'MQ40' "
    "  -filter 'MQRankSum < -12.5' --filter-name 'MQRankSum-12.5' "
    "  -filter 'ReadPosRankSum < -8.0' --filter-name 'ReadPosRankSum-8' "
    " -O {output.vcf} > {log} 2>&1 "
88
89
90
91
92
93
94
95
shell:
    "gatk VariantFiltration "
    " -V {input.vcf} "
    "  -filter 'QD < 2.0' --filter-name 'QD2' "
    "  -filter 'QUAL < 30.0' --filter-name 'QUAL30' "
    "  -filter 'FS > 200.0' --filter-name 'FS200' "
    "  -filter 'ReadPosRankSum < -20.0' --filter-name 'ReadPosRankSum-20' "
    " -O {output.vcf} > {log} 2>&1 "
114
115
116
shell:
    "(bcftools concat -a {input.snp} {input.indel} | "
    " bcftools view -Ob > {output.vcf}; ) 2> {log} "
132
133
134
shell:
    " bcftools view -Ob -i 'FILTER=\"PASS\" & MAF > {params.maf} ' "
    " {input} > {output} 2>{log} "
20
21
wrapper:
    "v1.1.0/bio/trimmomatic/pe"
50
51
wrapper:
    "v1.23.3/bio/bwa/mem"
70
71
wrapper:
    "v1.1.0/bio/picard/markduplicates"
11
12
wrapper:
    "v1.1.0/bio/fastqc"
25
26
wrapper:
    "v1.1.0/bio/fastqc"
41
42
shell:
    "samtools stats {input} > {output} 2> {log} "
71
72
wrapper:
    "v1.23.3/bio/multiqc"
102
103
104
105
106
shell:
    "bcftools view {params.filter_opt} -Ou {input} | "
    " bcftools stats -s {params.comma_samples} "
    " -u NMISS:0:{params.stop}:{params.steps} > "
    " {output} 2> {log} "
127
128
129
130
shell:
    " bcftools stats -s {params.comma_samples} "
    " -u NMISS:0:{params.stop}:{params.steps}  {input} > "
    " {output} 2> {log} "
142
143
shell:
    " plot-vcfstats -m {input} > {output} 2>{log} "
155
156
shell:
    " plot-vcfstats -m {input} > {output} 2>{log} "
12
13
14
15
16
17
18
shell:
    " (tmp_dir=$(mktemp -d) && "
    " URL={params.url} && "
    " if [[ $URL =~ \.gz$ ]]; then EXT='.gz'; else EXT=''; fi && "
    " wget -O $tmp_dir/file$EXT $URL && "
    " if [[ $URL =~ \.gz$ ]]; then gunzip $tmp_dir/file$EXT; fi && "
    " mv $tmp_dir/file {output}) > {log} 2>&1 "
SnakeMake From line 12 of rules/ref.smk
31
32
shell:
    "samtools faidx {input}"
46
47
shell:
    "samtools dict {input} > {output} 2> {log} "
63
64
wrapper:
    "0.59.2/bio/bwa/index"
SnakeMake From line 63 of rules/ref.smk
10
11
12
13
shell:
    "(bcftools view --apply-filters PASS --output-type u {input} | "
    "rbt vcf-to-txt -g --fmt DP AD --info ANN | "
    "gzip > {output}) 2> {log}"
30
31
script:
    "../scripts/plot-depths.py"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import sys
sys.stderr = open(snakemake.log[0], "w")

import common
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

calls = pd.read_table(snakemake.input[0], header=[0, 1])
samples = [name for name in calls.columns.levels[0] if name != "VARIANT"]
sample_info = calls.loc[:, samples].stack([0, 1]).unstack().reset_index(1, drop=False)
sample_info = sample_info.rename({"level_1": "sample"}, axis=1)

sample_info = sample_info[sample_info["DP"] > 0]
sample_info["freq"] = sample_info["AD"] / sample_info["DP"]
sample_info.index = np.arange(sample_info.shape[0])

plt.figure()

sns.stripplot(x="sample", y="freq", data=sample_info, jitter=True)
plt.ylabel("allele frequency")
plt.xticks(rotation="vertical")

plt.savefig(snakemake.output.freqs)

plt.figure()

sns.stripplot(x="sample", y="DP", data=sample_info, jitter=True)
plt.ylabel("read depth")
plt.xticks(rotation="vertical")

plt.savefig(snakemake.output.depths)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
INP=$1
QD=$2
QUAL=$3
LOG=$4

(bcftools query -f '%QUAL\t%INFO/QD\n' $INP | 
awk -v qd_file=${QD}-unsrt -v qual_file=${QUAL}-unsrt  '
	BEGIN {OFS="\t"}

	{
		qual[int($1)]++; 
		qd[int($2)]++;
	}

	END {
		for(i in qual) print i, qual[i] > qual_file;
    	for(i in qd) print i, qd[i] > qd_file;
	}
') 2> $LOG;

(sort -grk1 ${QUAL}-unsrt > ${QUAL} && rm ${QUAL}-unsrt) 2>>$LOG;
(sort -grk1 ${QD}-unsrt > ${QD} && rm ${QD}-unsrt) 2>>$LOG
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
log <- file(snakemake@log[[1]], open="wt")
sink(log, type = "output")
sink(log, type = "message")

library(tidyverse)


# these are while developing
#chroms_file = ".test/config/chromosomes.tsv"
#scaffs_file = ".test/config/scaffold_groups.tsv"
#max_chunk <- 900000  # shooting for just < than 1 Mb chunks


chroms_file <- snakemake@input$chroms
scaffs_file <- snakemake@input$scaffs
max_chunk <- as.integer(snakemake@params$binsize)
outfile <- snakemake@output$interval_list


# read the files in and prepare them so they have
# columns that are comparable
chroms <- read_tsv(chroms_file) %>%
  mutate(id = chrom, .before = chrom) %>%
  group_by(id) %>%
  mutate(cumul = cumsum(num_bases)) %>%
  ungroup()

scaffs <- read_tsv(scaffs_file) %>%
  rename(num_bases = len)


# Now, we handle the chromosomes and the scaffold groups a little
# differently.  (It's a shame really, they should just all be scaff_groups,
# with the chromosomes just being scaff groups of size 1.  Sigh...)

# for the chromos, we just find a good length to chop them up into.  
# Here is a function that tries to get them all close to an equal length
# that is less than max_chunk:
#' @param L length of the chromosome
#' @param m max length of chunk
#' @param S the starting value
#' @return returns a tibble with start and stop columns of the different
#' chunks. (base-1 indexed, inclusive).
chromo_chopper = function(L, m, S = 1) {
  # The number of bins will be:
  B = ceiling(L / m)

  # if m divides L perfectly, mstar is m and last_one = mstar
  if((L %% m) == 0) {
    mstar = m
    last_one = mstar
  } else {  # otherwise
    mstar = ceiling(L / B)
    last_one = L - mstar * (B - 1)
  }

  starts <- seq(S, S + B * (mstar - 1), by = mstar)
  ends <- starts + mstar - 1
  ends[length(ends)] <- starts[length(ends)] + last_one - 1

  tibble(
    start = starts,
    end = ends
  ) %>%
    mutate(scatter_length = end - start + 1) %>%
    mutate(scatter_idx = sprintf("scat_%04d", 1:n()), .before = start)
}


chroms2 <- chroms %>%
  group_by(id) %>%
  mutate(scats = map(num_bases, chromo_chopper, m = max_chunk)) %>%
  unnest(scats) %>%
  select(id, scatter_idx, chrom, start, end, scatter_length)

# now, for the scaffold groups we will do it a little differently.
# Any scaffold that is greater than the max_chunk will get broken down into
# roughly equal-sized chunks that are all less than the max_chunk.
scaff_chopper <- function(L, m) {
  if(L <= m) return(
    tibble(
      start = 1,
      end = L,
      seg_length = L
    )
  )
  else {
    return(
      chromo_chopper(L, m) %>%
        rename(seg_length = scatter_length) %>%
        select(-scatter_idx)
    )
  }
}

scaffs2 <- scaffs %>%
  ungroup() %>%
  mutate(chops = map(num_bases, scaff_chopper, m = max_chunk)) %>%
  unnest(chops) %>%
  group_by(id) %>%
  mutate(cumul = cumsum(seg_length))


# now we will group by id and within each one, just iterate through
# the number of bases and assign to different scat groups.
#' @param s the number of bases in each segment (i.e. the seg_length)
#' @param m the desired max chunk size
iteratively_assign_scats <- function(n, m) {
  c <- 0  # cumulative in scat group
  scg <- 1
  ret <- character(length(n))
  for(i in 1:length(n)) {
    c <- c + n[i]
    if(c > m) {
      scg <- scg + 1
      c <- n[i]
    }
    ret[i] <- sprintf("scat_%04d", scg)
  }
  ret
}

scaffs3 <- scaffs2 %>%
  group_by(id) %>%
  mutate(scatter_idx = iteratively_assign_scats(n = seg_length, m = max_chunk)) %>%
  group_by(id, scatter_idx) %>%
  mutate(scatter_length = sum(seg_length)) %>%
  ungroup() %>%
  select(id, scatter_idx, chrom, start, end, scatter_length)

chrom_and_sg <- bind_rows(chroms2, scaffs3)

write_tsv(chrom_and_sg, file = snakemake@output$tsv)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
java_opts = snakemake.params.get("java_opts", "")

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
    "gatk --java-options '{java_opts}' SelectVariants -R {snakemake.input.ref} -V {snakemake.input.vcf} "
    "{extra} -O {snakemake.output.vcf} {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2016, Patrik Smeds"
__email__ = "[email protected]"
__license__ = "MIT"

from os import path

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

# Check inputs/arguments.
if len(snakemake.input) == 0:
    raise ValueError("A reference genome has to be provided!")
elif len(snakemake.input) > 1:
    raise ValueError("Only one reference genome can be inputed!")

# Prefix that should be used for the database
prefix = snakemake.params.get("prefix", "")

if len(prefix) > 0:
    prefix = "-p " + prefix

# Contrunction algorithm that will be used to build the database, default is bwtsw
construction_algorithm = snakemake.params.get("algorithm", "")

if len(construction_algorithm) != 0:
    construction_algorithm = "-a " + construction_algorithm

shell(
    "bwa index" " {prefix}" " {construction_algorithm}" " {snakemake.input[0]}" " {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
java_opts = snakemake.params.get("java_opts", "")
filters = [
    "--filter-name {} --filter-expression '{}'".format(name, expr.replace("'", "\\'"))
    for name, expr in snakemake.params.filters.items()
]

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
    "gatk --java-options '{java_opts}' VariantFiltration -R {snakemake.input.ref} -V {snakemake.input.vcf} "
    "{extra} {filters} -O {snakemake.output.vcf} {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os

from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
java_opts = snakemake.params.get("java_opts", "")


def fmt_res(resname, resparams):
    fmt_bool = lambda b: str(b).lower()
    try:
        f = snakemake.input.get(resname)
    except KeyError:
        raise RuntimeError(
            "There must be a named input file for every resource (missing: {})".format(
                resname
            )
        )
    return "{},known={},training={},truth={},prior={}:{}".format(
        resname,
        fmt_bool(resparams["known"]),
        fmt_bool(resparams["training"]),
        fmt_bool(resparams["truth"]),
        resparams["prior"],
        f,
    )


resources = [
    "--resource {}".format(fmt_res(resname, resparams))
    for resname, resparams in snakemake.params["resources"].items()
]
annotation = list(map("-an {}".format, snakemake.params.annotation))
tranches = ""
if snakemake.output.tranches:
    tranches = "--tranches-file " + snakemake.output.tranches

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
    "gatk --java-options '{java_opts}' VariantRecalibrator {extra} {resources} "
    "-R {snakemake.input.ref} -V {snakemake.input.vcf} "
    "-mode {snakemake.params.mode} "
    "--output {snakemake.output.vcf} "
    "{tranches} {annotation} {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2018, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


inputs = " ".join("INPUT={}".format(f) for f in snakemake.input.vcfs)
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra = snakemake.params.get("extra", "")

shell(
    "picard"
    " MergeVcfs"
    " {extra}"
    " {inputs}"
    " OUTPUT={snakemake.output[0]}"
    " {log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2020, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"

from pathlib import Path
from snakemake.shell import shell


def get_only_child_dir(path):
    children = [child for child in path.iterdir() if child.is_dir()]
    assert (
        len(children) == 1
    ), "Invalid VEP cache directory, only a single entry is allowed, make sure that cache was created with the snakemake VEP cache wrapper"
    return children[0]


extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)

fork = "--fork {}".format(snakemake.threads) if snakemake.threads > 1 else ""
stats = snakemake.output.stats
cache = snakemake.input.cache
plugins = snakemake.input.plugins

entrypath = get_only_child_dir(get_only_child_dir(Path(cache)))
species = entrypath.parent.name
release, build = entrypath.name.split("_")

load_plugins = " ".join(map("--plugin {}".format, snakemake.params.plugins))

if snakemake.output.calls.endswith(".vcf.gz"):
    fmt = "z"
elif snakemake.output.calls.endswith(".bcf"):
    fmt = "b"
else:
    fmt = "v"

shell(
    "(bcftools view {snakemake.input.calls} | "
    "vep {extra} {fork} "
    "--format vcf "
    "--vcf "
    "--cache "
    "--cache_version {release} "
    "--species {species} "
    "--assembly {build} "
    "--dir_cache {cache} "
    "--dir_plugins {plugins} "
    "--offline "
    "{load_plugins} "
    "--output_file STDOUT "
    "--stats_file {stats} | "
    "bcftools view -O{fmt} > {snakemake.output.calls}) {log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
import re
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)
    # Remove file extension(s) (similar to the internal fastqc approach)
    base = re.sub("\\.gz$", "", base)
    base = re.sub("\\.bz2$", "", base)
    base = re.sub("\\.txt$", "", base)
    base = re.sub("\\.fastq$", "", base)
    base = re.sub("\\.fq$", "", base)
    base = re.sub("\\.sam$", "", base)
    base = re.sub("\\.bam$", "", base)

    return base


# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc {snakemake.params} -t {snakemake.threads} "
        "--outdir {tempdir:q} {snakemake.input[0]:q}"
        " {log}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import tempfile
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

log = snakemake.log_fmt_shell()

extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)

bams = snakemake.input
if isinstance(bams, str):
    bams = [bams]
bams = list(map("--INPUT {}".format, bams))

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "picard MarkDuplicates"  # Tool and its subcommand
        " {java_opts}"  # Automatic java option
        " {extra}"  # User defined parmeters
        " {bams}"  # Input bam(s)
        " --TMP_DIR {tmpdir}"
        " --OUTPUT {snakemake.output.bam}"  # Output bam
        " --METRICS_FILE {snakemake.output.metrics}"  # Output metrics
        " {log}"  # Logging
    )
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
__author__ = "Johannes Köster, Jorge Langa"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts

# Distribute available threads between trimmomatic itself and any potential pigz instances
def distribute_threads(input_files, output_files, available_threads):
    gzipped_input_files = sum(1 for file in input_files if file.endswith(".gz"))
    gzipped_output_files = sum(1 for file in output_files if file.endswith(".gz"))
    potential_threads_per_process = available_threads // (
        1 + gzipped_input_files + gzipped_output_files
    )
    if potential_threads_per_process > 0:
        # decompressing pigz creates at most 4 threads
        pigz_input_threads = (
            min(4, potential_threads_per_process) if gzipped_input_files != 0 else 0
        )
        pigz_output_threads = (
            (available_threads - pigz_input_threads * gzipped_input_files)
            // (1 + gzipped_output_files)
            if gzipped_output_files != 0
            else 0
        )
        trimmomatic_threads = (
            available_threads
            - pigz_input_threads * gzipped_input_files
            - pigz_output_threads * gzipped_output_files
        )
    else:
        # not enough threads for pigz
        pigz_input_threads = 0
        pigz_output_threads = 0
        trimmomatic_threads = available_threads
    return trimmomatic_threads, pigz_input_threads, pigz_output_threads


def compose_input_gz(filename, threads):
    if filename.endswith(".gz") and threads > 0:
        return "<(pigz -p {threads} --decompress --stdout {filename})".format(
            threads=threads, filename=filename
        )
    return filename


def compose_output_gz(filename, threads, compression_level):
    if filename.endswith(".gz") and threads > 0:
        return ">(pigz -p {threads} {compression_level} > {filename})".format(
            threads=threads, compression_level=compression_level, filename=filename
        )
    return filename


extra = snakemake.params.get("extra", "")
java_opts = get_java_opts(snakemake)
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
compression_level = snakemake.params.get("compression_level", "-5")
trimmer = " ".join(snakemake.params.trimmer)

# Distribute threads
input_files = [snakemake.input.r1, snakemake.input.r2]
output_files = [
    snakemake.output.r1,
    snakemake.output.r1_unpaired,
    snakemake.output.r2,
    snakemake.output.r2_unpaired,
]

trimmomatic_threads, input_threads, output_threads = distribute_threads(
    input_files, output_files, snakemake.threads
)

input_r1, input_r2 = [
    compose_input_gz(filename, input_threads) for filename in input_files
]

output_r1, output_r1_unp, output_r2, output_r2_unp = [
    compose_output_gz(filename, output_threads, compression_level)
    for filename in output_files
]

shell(
    "trimmomatic PE -threads {trimmomatic_threads} {java_opts} {extra} "
    "{input_r1} {input_r2} "
    "{output_r1} {output_r1_unp} "
    "{output_r2} {output_r2_unp} "
    "{trimmer} "
    "{log}"
)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
__author__ = "Johannes Köster, Julian de Ruiter"
__copyright__ = "Copyright 2016, Johannes Köster and Julian de Ruiter"
__email__ = "[email protected], [email protected]"
__license__ = "MIT"


import tempfile
from os import path
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
from snakemake_wrapper_utils.samtools import get_samtools_opts


# Extract arguments.
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
sort = snakemake.params.get("sorting", "none")
sort_order = snakemake.params.get("sort_order", "coordinate")
sort_extra = snakemake.params.get("sort_extra", "")
samtools_opts = get_samtools_opts(snakemake)
java_opts = get_java_opts(snakemake)


index = snakemake.input.idx
if isinstance(index, str):
    index = path.splitext(snakemake.input.idx)[0]
else:
    index = path.splitext(snakemake.input.idx[0])[0]


# Check inputs/arguments.
if not isinstance(snakemake.input.reads, str) and len(snakemake.input.reads) not in {
    1,
    2,
}:
    raise ValueError("input must have 1 (single-end) or 2 (paired-end) elements")


if sort_order not in {"coordinate", "queryname"}:
    raise ValueError("Unexpected value for sort_order ({})".format(sort_order))


# Determine which pipe command to use for converting to bam or sorting.
if sort == "none":
    # Simply convert to bam using samtools view.
    pipe_cmd = "samtools view {samtools_opts}"

elif sort == "samtools":
    # Add name flag if needed.
    if sort_order == "queryname":
        sort_extra += " -n"

    # Sort alignments using samtools sort.
    pipe_cmd = "samtools sort {samtools_opts} {sort_extra} -T {tmpdir}"

elif sort == "picard":
    # Sort alignments using picard SortSam.
    pipe_cmd = "picard SortSam {java_opts} {sort_extra} --INPUT /dev/stdin --TMP_DIR {tmpdir} --SORT_ORDER {sort_order} --OUTPUT {snakemake.output[0]}"

else:
    raise ValueError(f"Unexpected value for params.sort ({sort})")

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "(bwa mem"
        " -t {snakemake.threads}"
        " {extra}"
        " {index}"
        " {snakemake.input.reads}"
        " | " + pipe_cmd + ") {log}"
    )
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
# Set this to False if multiqc should use the actual input directly
# instead of parsing the folders where the provided files are located
use_input_files_only = snakemake.params.get("use_input_files_only", False)

if not use_input_files_only:
    input_data = set(path.dirname(fp) for fp in snakemake.input)
else:
    input_data = set(snakemake.input)

output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {extra}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_data}"
    " {log}"
)
ShowHide 56 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/eriqande/mega-non-model-wgs-snakeflow
Name: mega-non-model-wgs-snakeflow
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...