Variant Calling Workflow for Joint Multi-Sample VCF Generation

public public 1yr ago Version: 1.0.0 0 bookmarks

Forked from https://gitlab.com/data-analysis5/dna-sequencing/54gene-wgs-germline

This workflow is designed to take either fastqs or gvcfs as input, and emit a joint-called multi-sample VCF. Please see Read the Docs for additional documentation.

You can find

Code Snippets

44
45
46
47
48
49
50
51
shell:
    "mkdir -p {output.t} && "
    "bwa mem "
    "-K 10000000 -M "
    '-R "@RG\\tCN:{params.center}\\tID:{params.rg}\\tSM:{params.sm}\\tPL:{params.pl}\\tLB:N/A" '
    "-t {threads} "
    "{input.r} {input.r1} {input.r2} | "
    "samtools sort -@ {params.samtools_threads} -T {output.t} -m {params.samtools_mem}M -o {output.bam} - "
86
87
88
89
90
91
92
shell:
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" MarkDuplicates '
    "TMP_DIR={params.t} "
    "REMOVE_DUPLICATES=true "
    "INPUT={params.l} "
    "OUTPUT={output.bam} "
    "METRICS_FILE={output.metrics}"
119
120
121
122
123
124
125
126
shell:
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" BaseRecalibrator '
    "--tmp-dir {params.t} "
    "-R {input.r} "
    "-I {input.bam} "
    "--known-sites {input.snps} "
    "--known-sites {input.indels} "
    "-O {output.table}"
154
155
156
157
158
159
160
shell:
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" ApplyBQSR '
    "--tmp-dir {params.t} "
    "-R {input.r} "
    "-I {input.bam} "
    "--bqsr-recal-file {input.recal} "
    "-O {output.bam}"
176
177
shell:
    "samtools stats {input.bam} > {output}"
21
22
23
shell:
    "ln -s {input.r1} {output.r1} && "
    "ln -s {input.r2} {output.r2}"
47
48
49
shell:
    "fastqc {input.r1} -d {params.t} --quiet -t {threads} --outdir={params.o} && "
    "fastqc {input.r2} -d {params.t} --quiet -t {threads} --outdir={params.o}"
77
78
79
80
81
82
83
84
shell:
    "fastp -i {input.r1} -I {input.r2} "
    "-w {threads} "
    "-o {output.r1_paired} -O {output.r2_paired} "
    "-h {output.h} -j {output.j} "
    "--adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA "
    "--adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT "
    "-l 36"
24
25
26
27
28
29
30
31
shell:
    "bcftools norm "
    "-f {input.ref} "
    "-m - "
    "--threads {threads} "
    "{input.vcf} "
    "-Oz -o {output.vcf} && "
    "tabix -p vcf {output.vcf}"
53
54
55
56
57
58
59
shell:
    'export _JAVA_OPTIONS="" && '
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" SelectVariants '
    "--tmp-dir {params.t} "
    "-V {input.vcf} "
    "--select-type SNP "
    "-O {output.vcf}"
81
82
83
84
85
86
87
shell:
    'export _JAVA_OPTIONS="" && '
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" SelectVariants '
    "--tmp-dir {params.t} "
    "-V {input.vcf} "
    "--select-type INDEL "
    "-O {output.vcf}"
109
110
111
112
113
114
115
116
117
118
119
120
121
shell:
    'export _JAVA_OPTIONS="" && '
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" VariantFiltration '
    "--tmp-dir {params.t} "
    "-V {input.vcf} "
    '-filter "QD < 2.0" --filter-name "QD2" '
    '-filter "QUAL < 30.0" --filter-name "QUAL30" '
    '-filter "SOR > 3.0" --filter-name "SOR3" '
    '-filter "FS > 60.0" --filter-name "FS60" '
    '-filter "MQ < 40.0" --filter-name "MQ40" '
    '-filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" '
    '-filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" '
    "-O {output.vcf}"
143
144
145
146
147
148
149
150
151
152
shell:
    'export _JAVA_OPTIONS="" && '
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" VariantFiltration '
    "--tmp-dir {params.t} "
    "-V {input.vcf} "
    '-filter "QD < 2.0" --filter-name "QD2" '
    '-filter "QUAL < 30.0" --filter-name "QUAL30" '
    '-filter "FS > 200.0" --filter-name "FS200" '
    '-filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" '
    "-O {output.vcf}"
179
180
181
shell:
    "bcftools view --threads {threads} -r {params.region} -Oz -o {output.vcf} {input.vcf} && "
    "tabix -p vcf {output.vcf}"
205
206
207
208
209
210
shell:
    "verifyBamID "
    "--best "
    "--vcf {input.vcf} "
    "--bam {input.bam} "
    "--out {params.d}"
SnakeMake From line 205 of rules/filter.smk
220
221
shell:
    'grep -v "^#" {params}*selfSM  > {output}'
SnakeMake From line 220 of rules/filter.smk
245
246
247
248
shell:
    "bcftools concat -a -Ov {input.snps} {input.indels} | "
    "bcftools sort -T {params.t} -Oz -o {output.vcf} && "
    "tabix -p vcf {output.vcf}"
279
280
281
282
283
284
285
286
shell:
    'export _JAVA_OPTIONS="" && '
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" CollectVariantCallingMetrics '
    "--TMP_DIR {params.t} "
    "-I {input.calls} "
    "--DBSNP {input.dbsnp} "
    "-SD {input.d} "
    "-O {params.d}"
310
311
312
313
314
315
316
shell:
    'export _JAVA_OPTIONS="" && '
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" SelectVariants '
    "--tmp-dir {params.t} "
    "-V {input.vcf} "
    "--exclude-filtered "
    "-O {output.vcf}"
35
36
37
38
39
40
41
42
43
44
shell:
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" HaplotypeCaller '
    "--tmp-dir {params.t} "
    "-R {input.r} "
    "-I {input.bam} "
    "-ERC GVCF "
    "-L {input.intervalfile} "
    "-O {output.gvcf} "
    "-G StandardAnnotation "
    "-G StandardHCAnnotation"
59
60
61
shell:
    "bgzip -c {input.gvcf} > {output.gvcf} && "
    "tabix -p vcf {input.gvcf}.gz"
94
95
96
97
shell:
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" GatherVcfs '
    "-I {params.l} "
    "-O {output}"
110
111
shell:
    "tabix -p vcf {input}"
14
15
16
shell:
    "ln -s {input.gvcf} {output.gvcf} && "
    "ln -s {input.index} {output.index}"
39
40
41
shell:
    "n=$(bcftools query -l {input.gvcf});"
    'echo "${{n}}\t{input.gvcf}" > {output}'
54
55
shell:
    "for i in {params.mapDir}*.map; do cat $i >> {output}; done"
126
127
128
129
130
131
132
133
134
135
136
137
shell:
    'export _JAVA_OPTIONS="" && '
    "rm -r {params.db} && "
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" GenomicsDBImport '
    "--batch-size {params.batch_size} "
    "--disable-bam-index-caching "
    "--sample-name-map {input.sampleMap} "
    "--genomicsdb-workspace-path {params.db} "
    "-L {input.intervalfile} "
    "--tmp-dir {params.t} "
    "--reader-threads {params.reader_threads} "
    "--genomicsdb-shared-posixfs-optimizations"
170
171
172
173
174
175
176
177
178
179
180
181
182
shell:
    'export _JAVA_OPTIONS="" && '
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" GenotypeGVCFs '
    "-R {input.r} "
    "-V gendb://{params.db} "
    "-O {output.vcf} "
    "--tmp-dir {params.t} "
    "--max-alternate-alleles {params.m} "
    "--genomicsdb-max-alternate-alleles {params.g} "
    "--max-genotype-count {params.c} "
    "-stand-call-conf 30 "
    "-G StandardAnnotation "
    "-G StandardHCAnnotation"
204
205
206
207
shell:
    "bcftools concat -a {input.vcfList} -Ou | "
    "bcftools sort -T {params.t} -Oz -o {output.projectVCF} && "
    "tabix -p vcf {output.projectVCF}"
16
17
18
19
20
shell:
    "bcftools stats "
    "--af-bins 0.01,0.05,0.1,1 "
    "-F {input.r} "
    "-s- {input.vcf} > {output}"
52
53
shell:
    "plot-vcfstats -P -p {params.d} {input}"
65
66
shell:
    "python workflow/scripts/generate_ped.py {input} {params.prefix}"
87
88
89
90
91
92
shell:
    "somalier extract "
    "-d {params.d} "
    "--sites $CONDA_PREFIX/share/somalier/sites.hg38.vcf.gz "
    "-f {input.r} {input.vcf} && "
    "somalier relate --ped {input.ped} -o {params.o} {params.d}/*.somalier"
103
104
shell:
    "touch {output}"
120
121
122
shell:
    "bcftools +guess-ploidy -g hg38 {input.vcf} -v > {output.txt} && "
    "guess-ploidy.py {output.txt} {params.p}"
143
144
shell:
    "python workflow/scripts/create_exclude_list.py {input.b} {params.out} --verify {input.v} -r {params.r} -d {params.d} -c {params.c}"
163
164
shell:
    "python workflow/scripts/create_exclude_list.py {input.b} {params.out} -r {params.r} -d {params.d}"
180
181
182
183
shell:
    "bcftools view -S ^{input.l} --threads {threads} -Ou {input.v} | "
    "bcftools annotate --threads {threads} --set-id '%CHROM:%POS:%REF:%ALT' -Oz -o {output.v} && "
    "tabix -p vcf {output.v}"
227
228
shell:
    "multiqc --force -o {params.outDir} -n {params.outName} --config {input.mqc_config} {params.inDirs} {params.relatedness}"
263
264
shell:
    "multiqc --force -o {params.outDir} --config {input.mqc_config} -n {params.outName} {params.inDirs} {params.relatedness}"
290
291
shell:
    "multiqc --force -o {params.outDir} --config {input.mqc_config} -n {params.outName} {params.inDirs}"
14
15
shell:
    "bcftools query -l {input} > {output}"
46
47
script:
    "../scripts/run_summary.Rmd"
77
78
script:
    "../scripts/run_summary.Rmd"
103
104
script:
    "../scripts/run_initial_summary.Rmd"
SnakeMake From line 103 of rules/report.smk
131
132
script:
    "../scripts/combine_benchmarks.R"
SnakeMake From line 131 of rules/report.smk
151
152
script:
    "../scripts/benchmarking_report.Rmd"
SnakeMake From line 151 of rules/report.smk
24
25
26
27
28
29
30
31
32
33
34
35
36
37
shell:
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dict resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.alt resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.amb resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.ann resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.bwt resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.pac resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.sa resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx resources/ --no-sign-request "
5
6
shell:
    "date +%s > {output}"
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# Load required packages
knitr::opts_chunk$set(echo = TRUE, fig.width=12, fig.height=8)
require(knitr, quietly = TRUE)
require(ggplot2, quietly=TRUE)
require(RColorBrewer, quietly=TRUE)
require(dplyr, quietly = TRUE)
require(gluedown, quietly = TRUE)
require(gridExtra, quietly = TRUE)
require(lubridate, quietly = TRUE)

# set input variable from snakemake
benchmarks.table <- snakemake@input[["benchmarks"]]

# read in tsv
dataset <- read.table(benchmarks.table, header=TRUE, sep="\t")

# some light date/time manipulation for ease in plotting
dataset$times <- hms(dataset$h.m.s)
d.lub <- hour(dataset$times) + minute(dataset$times)/60 + second(dataset$times)/3600
dataset$walltime <- d.lub

# some upfront unit conversions  & derivations for ease in comprehension
# # convert IO units from MB to GB
dataset$io_in <- dataset$io_in / 1000
dataset$io_out <- dataset$io_out / 1000

# # convert CPU time from seconds to hours
dataset$cpu_time_hrs <- dataset$cpu_time / 3600

# # calculate cpu/walltime ratio
dataset$cpuwall_ratio <- (dataset$cpu_time / dataset$s)

# subset the dataset to get data for rules where there is more than 1
# observation/process to pass to geom_boxplot
subset.df <- dataset %>% group_by(rule) %>% filter(n() > 1)

# subset the dataset to get data points for rules meeting time_threshold
# parameter specified at the rule level
time.param <- snakemake@params[["threshold"]]
time.threshold <- time.param * 60
filtered.df <- dataset %>% group_by(rule) %>% filter(s >= time.threshold)

# start time to calculate elapsed time
tat.file <- snakemake@input[["start_time"]]
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# plotting function for box/jitter plot where single data points are omitted
# from boxplots
create.boxjitter.plot <- function(df, df.subset, xvar, yvar, xlab, ylab) {
  bd.plot <- ggplot(data=df, aes_string(x=xvar, y=yvar)) +
    geom_boxplot(data = df.subset, aes_string(fill=xvar), show.legend = FALSE)  +
    geom_point(position="identity", show.legend = FALSE, alpha=1/2, size=1) +
    labs(x=xlab, y=ylab) +
    theme_minimal() +
    theme(axis.text.x=element_text(angle = 90, vjust=0.5, hjust=1),
          axis.title = element_text(size = 14),
          axis.text = element_text(size = 12))
  bd.plot
}

create.subset.plot <- function(subdf, xvar, yvar, xlab, ylab, title){
  sb.plot <- ggplot(data = subdf, aes_string(x = xvar, y = yvar)) +
    geom_boxplot(aes_string(fill=xvar), show.legend = FALSE) +
    geom_point(position = "identity", show.legend = FALSE, alpha=1/2, size=1) +
    labs(x=xlab, y=ylab, title=title) +
    theme_minimal() +
    theme(axis.text.x=element_text(angle = 90, vjust=0.5, hjust=1),
          axis.title = element_text(size = 14),
          axis.text = element_text(size = 12))
  sb.plot
}
95
96
97
98
inlines <- c(
  md_bold(unique(dataset$rule))
)
md_bullet(inlines)
109
110
111
start.time <- read.table(tat.file, header=FALSE)[1,1]
end.time <- as.integer(format(Sys.time(), "%s"))
elapsed.time <- round((end.time - start.time) / 3600, 2)
115
116
117
# plot walltime for all rules benchmarked
exec.time.all <- create.boxjitter.plot(dataset, subset.df, "rule", "walltime", "Rule", "Total walltime in hours for all rules")
exec.time.all
129
130
131
132
max.pss.all <- create.boxjitter.plot(dataset, subset.df, "rule", "max_pss",
                                     "Rule", "PSS (MB) for all rules") +
  scale_y_continuous(n.breaks = 12)
max.pss.all
138
139
140
141
max.uss.all <- create.boxjitter.plot(dataset, subset.df, "rule", "max_uss",
                                     "Rule", "USS (MB) for all rules") +
  scale_y_continuous(n.breaks = 12)
max.uss.all
147
148
149
150
max.rss.all <- create.boxjitter.plot(dataset, subset.df, "rule", "max_rss",
                                     "Rule", "RSS (MB) for all rules") +
  scale_y_continuous(n.breaks = 12)
max.rss.all
156
157
158
159
max.vms.all <- create.boxjitter.plot(dataset, subset.df, "rule", "max_vms",
                                     "Rule", "VMS (MB) for all rules") +
  scale_y_continuous(n.breaks = 12)
max.vms.all
167
cat(paste0("Time threshold (minutes): ", time.param))
178
179
180
181
182
# plot the memory usage with boxplots for rules with >1 process otherwise
# just plot the datapoint
max.pss.detail <- create.boxjitter.plot(filtered.df, subset.df, "rule", "max_pss", "Rule", "PSS (MB)") +
  scale_y_continuous(n.breaks = 12)
max.pss.detail
188
189
190
max.uss.detail <- create.boxjitter.plot(filtered.df, subset.df, "rule", "max_uss", "Rule", "USS (MB)") +
  scale_y_continuous(n.breaks = 12)
max.uss.detail
196
197
198
max.rss.detail <- create.boxjitter.plot(dataset, subset.df, "rule", "max_rss", "Rule", "RSS (MB)") +
  scale_y_continuous(n.breaks = 12)
max.rss.detail
204
205
206
max.vms.detail <- create.boxjitter.plot(filtered.df, subset.df, "rule", "max_vms", "Rule", "VMS (MB)") +
  scale_y_continuous(n.breaks = 12)
max.vms.detail
211
212
213
214
215
216
217
# plot detailed grid view
mem.df1 <- subset(filtered.df, max_pss > 5000)
mem.df2 <- subset(filtered.df, max_pss < 5000)

max.pss.subset1 <- create.subset.plot(mem.df1, "rule", "max_pss", "Rule", "max_pss (MB)", "Rules using memory >5GB")
max.pss.subset2 <- create.subset.plot(mem.df2, "rule", "max_pss", "Rule", "max_pss (MB)", "Rules using memory <5GB")
grid.arrange(max.pss.subset1, max.pss.subset2, nrow=1)
228
229
230
# plot the io read
io.read <- create.boxjitter.plot(filtered.df, subset.df, "rule", "io_in", "Rule", "Cumulative GB Read")
io.read
236
237
238
# plot io write
io.write <- create.boxjitter.plot(filtered.df, subset.df, "rule", "io_out", "Rule", "Cumulative GB Written")
io.write
243
244
245
246
247
248
249
250
251
252
253
# plot detailed grid view
ioin.df1 <- subset(filtered.df, io_in > 10)
ioin.df2 <- subset(filtered.df, io_in < 10)

ioout.df1 <- subset(filtered.df, io_out > 10)
ioout.df2 <- subset(filtered.df, io_out < 10)

ioin.subset1 <- create.subset.plot(ioin.df1, "rule", "io_in", "Rule", "Cumulative GB Written", "Rules with >10GB Read")
ioout.subset1 <- create.subset.plot(ioout.df1, "rule", "io_out", "Rule", "Cumulative GB Written", "Rules with >10GB Written")

grid.arrange(ioin.subset1, ioout.subset1, nrow=2)
262
263
264
# plot the CPU time in hours
cpu.time <- create.boxjitter.plot(filtered.df, subset.df, "rule", "cpu_time_hrs", "Rule", "CPU time in hours")
cpu.time
279
280
281
# plot cpu to walltime ratio (i.e. CPU/s)
cpuwall.ratio <- create.boxjitter.plot(filtered.df, subset.df, "rule", "cpuwall_ratio", "Rule", "CPU/second")
cpuwall.ratio
288
sessionInfo()
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
library(tidyverse)

# explicitly specify column types to avoid errors when coercing rows
col.types <- cols(
  s = col_double(),
  `h:m:s` = col_time(format = ""),
  max_rss = col_double(),
  max_vms = col_double(),
  max_uss = col_double(),
  max_pss = col_double(),
  io_in = col_double(),
  io_out = col_double(),
  mean_load = col_double(),
  cpu_time = col_double()
)
# function to read tsv input and append rule name and process
read_benchmarks <- function(filename) {
  read_tsv(filename, col_types=col.types) %>%
    mutate(
      process = gsub("*.tsv","",basename(filename)),
      rule =  basename(dirname(filename))
      )
}

# run function for all specified snakemake inputs and output benchmarks tsv
dataset <- snakemake@input[["tsv"]] %>%
  lapply(read_benchmarks) %>%
  bind_rows()
head(dataset)
write_tsv(dataset, snakemake@output[['benchmarks']])
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
import argparse
from typing import Dict, List, Optional, Set, Tuple

import pandas as pd


def get_args():
    """Handle command line arguments"""
    parser = argparse.ArgumentParser(
        description="Generates a list of samples to exclude based on previously-run QC metrics."
    )
    parser.add_argument("bstats", type=str, help="bcftools stats file name")
    parser.add_argument("outfile", type=str, help="Output file prefix")
    parser.add_argument(
        "-v",
        "--verify",
        type=str,
        help="concatenated verifyBamID *.selfSM output files",
        default=False,
    )
    parser.add_argument(
        "-r", "--ratio", type=float, help="maximum allowed het/hom_alt ratio", default=2.5
    )
    parser.add_argument(
        "-d", "--depth", type=float, help="minimum allowed average depth", default=20.0
    )
    parser.add_argument(
        "-c", "--contam", type=float, help="maximum allowed contamination", default=0.03
    )
    results = parser.parse_args()
    return (
        results.bstats,
        results.outfile,
        results.verify,
        results.ratio,
        results.depth,
        results.contam,
    )


def read_in_bcftools_PSC(bstats: str) -> pd.DataFrame:
    """Read into a dataframe just the PSC portion of bcftools stats output"""
    skip: list[int] = [i for i, line in enumerate(open(bstats)) if not line.startswith("PSC")]
    return pd.read_csv(
        bstats,
        sep="\t",
        skiprows=skip,
        header=None,
        names=[
            "PSC",
            "id",
            "sample",
            "nRefHom",
            "nNonRefHom",
            "nHets",
            "nTransitions",
            "nTransversions",
            "nIndels",
            "average depth",
            "nSingletons",
            "nHapRef",
            "nHapAlt",
            "nMissing",
        ],
    )


def exclude_high_het_hom(df: pd.DataFrame, r: float) -> pd.DataFrame:
    """Exclude any samples with a het/hom ratio over 2.5"""
    df.loc[:, "het_hom_ratio"] = df["nHets"] / df["nNonRefHom"]
    df_het = df.loc[(df["het_hom_ratio"] > r)].copy()
    if df_het.empty:
        return pd.DataFrame({"sample": [], "exclude_reason": []})
    else:
        df_het.loc[:, "exclude_reason"] = "high_het_hom"
        return df_het[["sample", "exclude_reason"]]


def exclude_low_depth(df: pd.DataFrame, d: float) -> pd.DataFrame:
    """Setting a 20x avg depth cutoff

    This is reported by the lab as well, but this should catch samples
    that are borderline, that were accidentally included despite low
    coverage, or that are from outside collaborators."""
    df_depth = df.loc[df["average depth"] < d].copy()
    if df_depth.empty:
        return pd.DataFrame({"sample": [], "exclude_reason": []})
    else:
        df_depth.loc[:, "exclude_reason"] = "low_depth"
        return df_depth[["sample", "exclude_reason"]]


def read_in_verifybamid(verify: str) -> pd.DataFrame:
    """Requires single *.selfSM files to be concatenated without headers"""
    return pd.read_csv(
        verify,
        sep="\t",
        header=None,
        names=[
            "#SEQ_ID",
            "RG",
            "CHIP_ID",
            "#SNPS",
            "#READS",
            "AVG_DP",
            "FREEMIX",
            "FREELK1",
            "FREELK0",
            "FREE_RH",
            "FREE_RA",
            "CHIPMIX",
            "CHIPLK1",
            "CHIPLK0",
            "CHIP_RH",
            "CHIP_RA",
            "DPREF",
            "RDPHET",
            "RDPALT",
        ],
    )


def exclude_contam(df_v: pd.DataFrame, c: float) -> pd.DataFrame:
    """Setting a 3% contamination threshold"""
    df_contam = df_v.loc[df_v["FREEMIX"] > c].copy()
    if df_contam.empty:
        return pd.DataFrame({"sample": [], "exclude_reason": []})
    else:
        df_contam.loc[:, "exclude_reason"] = "contamination"
        t = df_contam["#SEQ_ID"].str.split(":", expand=True)
        if len(t.columns) == 2:
            df_contam["sample"] = t[1]
        else:
            # if you're running a single sample, verifyBamID doesn't use the sample_self:sample format in this field
            df_contam["sample"] = t[0]
        return df_contam[["sample", "exclude_reason"]]


def combine_exclusions(df_list: list) -> pd.DataFrame:
    """"""
    all_df = pd.concat(df_list).groupby("sample")["exclude_reason"].apply(",".join).reset_index()
    return all_df


if __name__ == "__main__":
    bstats, outfile, verify, r, d, c = get_args()
    df = read_in_bcftools_PSC(bstats)
    exclude1 = exclude_high_het_hom(df, r)
    exclude2 = exclude_low_depth(df, d)
    if verify:
        df_v = read_in_verifybamid(verify)
        exclude3 = exclude_contam(df_v, c)
        exclude_df = combine_exclusions([exclude1, exclude2, exclude3])
    else:
        exclude_df = combine_exclusions([exclude1, exclude2])

    exclude_df.to_csv(outfile + "_with_annotation.tsv", sep="\t", header=None, index=None)
    exclude_df[["sample"]].to_csv(outfile + ".tsv", sep="\t", header=None, index=None)
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
import argparse as ap
import re
import shutil
import sys

import pandas as pd


def get_args() -> list:
    """Get command line arguments"""
    parser = ap.ArgumentParser(
        description="Creates a plink-formatted ped from a tsv with subject and sex"
    )
    parser.add_argument("infile", type=str, help="linker filename")
    parser.add_argument(
        "outprefix", type=str, help="output filename prefix (.ped extension will be added)"
    )
    results = parser.parse_args()
    return (results.infile, results.outprefix)


def check_if_ped(infile: str) -> bool:
    """Look for .ped file extension

    This script can either be provided a tsv, which it will
    convert to a plink-style ped, or it can be provided a ped
    which it will simply copy to the expected output name/
    location.
    """
    if re.search(".ped$", infile, re.IGNORECASE):
        return True
    else:
        return False


def read_in_linker(infile: str) -> pd.DataFrame:
    """Read tsv linker file into a dataframe"""
    df = pd.read_table(infile, sep="\t")
    return df


def check_column_number(df: pd.DataFrame):
    """Check for two and only two columns in input tsv"""
    if len(df.columns) != 2:
        raise


def check_column_headers(df: pd.DataFrame):
    """Check for required headers in input tsv"""
    if ["Sample", "Sex"] != list(df.columns)[0:2]:
        raise


def add_ped_columns(df: pd.DataFrame) -> pd.DataFrame:
    """Add dummy columns to conform to plink ped format

    Should end up with a dataframe like so:
    Sample Sample 0 0 [F|M] -9
    """
    df.insert(1, "a", df.loc[:, "Sample"])
    df.insert(2, "b", 0)
    df.insert(2, "c", 0)
    df["d"] = "-9"
    return df


def check_sex_values(df: pd.DataFrame) -> pd.DataFrame:
    """Check for allowed values for sex information

    Input tsv may only have (case-insensitive) f, female,
    m, or male to encode sex.  Missing data can be encoded
    by one of the standard NA values that pandas can deal
    with automatically.  The error message emitted when this
    error is raised lists out the allowed values.
    """
    df["Sex"] = df["Sex"].str.lower()
    if not all(df[~df["Sex"].isna()].isin(["f", "female", "m", "male"])["Sex"]):
        raise


def encode_sex(df: pd.DataFrame) -> pd.DataFrame:
    """Convert sex information to ped encoding

    Missing sex information is converted to 0; male and
    female are converted to 1 and 2 respectively.
    """
    df["Sex"] = df["Sex"].str.lower()
    df["Sex"] = df["Sex"].replace(["f", "female", "m", "male"], [2, 2, 1, 1])
    df = df.fillna(0)
    df["Sex"] = df["Sex"].astype(int)
    return df


if __name__ == "__main__":
    infile, outprefix = get_args()
    if check_if_ped(infile):
        shutil.copyfile(infile, outprefix + ".ped")
        sys.exit(0)
    df = read_in_linker(infile)
    try:
        check_column_number(df)
    except Exception:
        sys.exit("Error: {} is required to have exactly two columns.".format(infile))
    try:
        check_column_headers(df)
    except Exception:
        sys.exit(
            "Error: Correct headers not detected in {} (requires 'Sample' and 'Sex', in that order).".format(
                infile
            )
        )
    try:
        check_sex_values(df)
    except Exception:
        sys.exit(
            "Error: Sex reported in {} must be represented by f, female, m, or male (case insensitive).  Missing data can be represented by any of the following strings: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', or 'null'.".format(
                infile
            )
        )

    df = add_ped_columns(df)
    df = encode_sex(df)
    df.to_csv(outprefix + ".ped", sep="\t", header=False, index=False)
23
24
25
26
27
28
## Load required R packages

library(knitr, quietly = TRUE)
library(kableExtra, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(RColorBrewer, quietly = TRUE)
32
33
## Load tested functions for this report
source(snakemake@input[["r_resources"]])
37
38
39
40
41
42
43
## Configure standard theme information for ggplot2

my.theme <- theme_light() + theme(plot.title = element_text(size = 15, hjust = 0.5),
                                  axis.title = element_text(size = 14),
                                  axis.text = element_text(size = 12),
                                  strip.background = element_blank(),
                                  strip.text = element_text(size = 14, colour = "black"))
47
48
49
50
input.subjects <- snakemake@params[["input_samples"]]
fastqc.filename <- snakemake@input[["fastqc"]]
tsv.path <- snakemake@params[["out_prefix"]]
start.time.filename <- snakemake@input[["start_time"]]
58
59
60
61
62
63
64
65
res <- prepare.initial.subject.tracking.table(input.subjects)
res <- add.fastqc.data(res, fastqc.filename)
res <- res[order(res[, "Subject"]), ]
rownames(res) <- NULL

knitr::kable(res, caption = "Summary of Subject Fate") %>% kableExtra::kable_styling("striped", position = "left", full_width = FALSE)

write.output.table(res, tsv.path)
74
75
76
start.time <- read.table(start.time.filename, header = FALSE)[1,1]
end.time <- as.integer(format(Sys.time(), "%s"))
elapsed.time <- round((end.time - start.time) / 3600, 2)
98
sessionInfo()
23
24
25
26
27
28
## Load required R packages

library(knitr, quietly = TRUE)
library(kableExtra, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(RColorBrewer, quietly = TRUE)
32
33
## Load tested functions for this report
source(snakemake@input[["r_resources"]])
37
38
39
40
41
42
43
## Configure standard theme information for ggplot2

my.theme <- theme_light() + theme(plot.title = element_text(size = 15, hjust = 0.5),
                                  axis.title = element_text(size = 14),
                                  axis.text = element_text(size = 12),
                                  strip.background = element_blank(),
                                  strip.text = element_text(size = 14, colour = "black"))
47
48
49
50
51
52
53
54
55
56
57
58
input.subjects <- snakemake@params[["input_samples"]]
exclude.reasons.filename <- snakemake@input[["exclude_list"]]
output.subjects.filename <- snakemake@input[["output_subject_list"]]
somalier.run <- snakemake@params[["somalier"]]
somalier.relatedness.filename <- snakemake@input[["relatedness"]]
guess.ploidy.filename <- snakemake@input[["sex"]]
sex.linker.filename <- snakemake@input[["ped_file"]]
fastqc.filename <- snakemake@input[["fastqc"]]
bcftools.stats.filename <- snakemake@input[["bcftools_stats"]]
tsv.path <- snakemake@params[["out_prefix"]]
start.time.filename <- snakemake@input[["start_time"]]
run.mode <- snakemake@params[["run_mode"]]
62
63
somalier.related.min <- 0.5
somalier.duplicate.min <- 0.95
74
75
76
77
78
79
80
81
82
83
84
85
86
res <- prepare.subject.tracking.table(input.subjects, output.subjects.filename, exclude.reasons.filename)

if (!run.mode == "jointgeno"){
  res <- add.fastqc.data(res, fastqc.filename)
}

res <- add.coverage(res, bcftools.stats.filename)
res <- res[order(res[, "Subject"]), ]
rownames(res) <- NULL

knitr::kable(res, caption = "Summary of Subject Fate") %>% kableExtra::kable_styling("striped", position = "left", full_width = FALSE)

write.output.table(res, tsv.path)
 97
 98
 99
100
101
102
103
res <- report.sex.discordances(sex.linker.filename, guess.ploidy.filename)

if (nrow(res) == 0) {
    cat("No sex discordant subjects were identified.")
} else {
    print(knitr::kable(res, caption = "Sex Discordant Subjects") %>% kableExtra::kable_styling("striped", position = "left", full_width = FALSE))
}
115
116
117
118
119
120
121
res <- report.related.subject.pairs(output.subjects.filename, somalier.relatedness.filename, somalier.duplicate.min, 1.0)

if (nrow(res) == 0) {
    cat("No duplicate subjects were detected above the specified relatedness cutoff.")
} else {
    print(knitr::kable(res, caption = "Genetic Duplicates") %>% kableExtra::kable_styling("striped", position = "left", full_width = FALSE))
}
135
136
137
138
139
140
141
res <- report.related.subject.pairs(output.subjects.filename, somalier.relatedness.filename, somalier.related.min, somalier.duplicate.min)

if (nrow(res) == 0) {
    cat("No non-identical related subjects were detected within the specified range.")
} else {
    print(knitr::kable(res, caption = "Potentially Related Non-identical Subjects") %>% kableExtra::kable_styling("striped", position = "left", full_width = FALSE))
}
150
151
152
start.time <- read.table(start.time.filename, header = FALSE)[1,1]
end.time <- as.integer(format(Sys.time(), "%s"))
elapsed.time <- round((end.time - start.time) / 3600, 2)
174
sessionInfo()
ShowHide 79 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/bballew/54gene-wgs-germline
Name: 54gene-wgs-germline
Version: 1.0.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Other Versions:
Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...