Snakemake workflow for evaluating varlociraptor

public public 1yr ago 0 bookmarks

This Snakemake workflow automatically generates all results and figures from the initial Varlociraptor paper focusing on somatic variant calling .

Rerunning the workflow requires a lot of computation time and some unavoidable external resources that have to be manually deployed . We therefore hope that the Snakemake report in the supplementary material of the paper , providing all results together with comprehensive provenance information (workflow steps, parameters, software versions, code) will already yield sufficient information in most of the cases. If you nevertheless intend to rerun the analysis, feel free to follow the steps below, and please inform us about any potential issues.

General Requirements

Any 64-bit Linux installation with GLIBC 2.5 or newer (i.e. any Linux distribution that is newer than CentOS 6). Note that the restriction of this workflow to Linux is purely a design decision (to save space and ensure reproducibility) and not related to Conda/Bioconda. Bioconda packages are available for both Linux and MacOS in general.

Usage

This workflow can be used to recreate all results found in the paper.

Step 1: Setup system

Variant a: Installing Miniconda on your system

If you are on a Linux system with GLIBC 2.5 or newer (i.e. any Linux distribution that is newer than CentOS 6), you can simply install Miniconda3 with

curl -o /tmp/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash /tmp/miniconda.sh

Make sure to answer yes to the question whether your PATH variable shall be modified. Afterwards, open a new shell/terminal.

Variant b: Use a Docker container

Otherwise, e.g., on MacOS or if you don't want to modify your system setup, install Docker , run

docker run -it continuumio/miniconda3 /bin/bash

and execute all the following steps within that container.

Variant c: Use an existing Miniconda installation

If you want to use an existing Miniconda installation, please be aware that this is only possible if it uses Python 3 by default. You can check this via

python --version

Further, ensure it is up to date with

conda update --all

Step 2: Setup Bioconda channel

Setup Bioconda with

conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

Step 3: Install Snakemake

Install Snakemake >=5.4.0 with

conda install snakemake

If you already have an older version of Snakemake, please make sure it is updated to >=5.4.0.

Step 4: Download the workflow

First, create a working directory:

mkdir varlociraptor-workflow
cd varlociraptor-workflow

Then, download the workflow archive from https://doi.org/10.5281/zenodo.3361700 and unpack it with

tar -xf workflow.tar.gz

Step 5: Obtain additional resources

In this special case there are unfortunately unavoidable additional requirements, due to licensing restrictions and data size.

  1. The required real data has to be obtained from EGA (EGAD00001002142) . After downloading it, edit the config.yaml in order to point to the right paths here and below .

  2. The required simulated data has to be obtained from Zenodo: https://doi.org/10.5281/zenodo.1421298. Download it, convert back to BAM, and edit the config.yaml in order to point to the right paths here and here .

  3. The synthetic data can be obtained by running a separate workflow: https://doi.org/10.5281/zenodo.3630241.

Once all data is obtained, edit the config.yaml under datasets: to point to the right file paths in your system.

Step 6: Run the workflow

Execute the analysis workflow with Snakemake

snakemake --use-conda

Please wait a few minutes for the analysis to finish. Results can be found in the folder figs/ . If you have been running the workflow in the docker container (see above), you can obtain the results with

docker cp <container-id>:/bioconda-workflow/figs .

whith <container-id> being the ID of the container.

Known errors

  • If you see an error like

    ImportError: No module named 'appdirs'
    

    when starting Snakemake, you are likely suffering from a bug in an older conda version. Make sure to update your conda installation with

    conda update --all
    

    and then reinstall the appdirs and snakemake package with

    conda install -f appdirs snakemake
    
  • If you see an error like

    ImportError: Missing required dependencies ['numpy']
    

    you are likely suffering from a bug in an older conda version. Make sure to update your conda installation with

    conda update --all
    

    and then reinstall the snakemake package with

    conda install -f snakemake
    

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell(
    "bcftools concat {snakemake.params} -o {snakemake.output[0]} "
    "{snakemake.input}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


try:
    exclude = "-x " + snakemake.input.exclude
except AttributeError:
    exclude = ""


extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "OMP_NUM_THREADS={snakemake.threads} delly call {extra} "
    "{exclude} -t {snakemake.params.vartype} -g {snakemake.input.ref} "
    "-o {snakemake.output[0]} {snakemake.input.samples} {log}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell(
    "bcftools concat {snakemake.params} -o {snakemake.output[0]} "
    "{snakemake.input}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell(
    "bcftools view {snakemake.params} {snakemake.input[0]} "
    "-o {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell(
    "bcftools concat {snakemake.params} -o {snakemake.output[0]} "
    "{snakemake.input.calls}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell(
    "bcftools view {snakemake.params} {snakemake.input[0]} "
    "-o {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell("picard MarkDuplicates {snakemake.params} INPUT={snakemake.input} "
      "OUTPUT={snakemake.output.bam} METRICS_FILE={snakemake.output.metrics} "
      "&> {snakemake.log}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell


prefix = os.path.splitext(snakemake.output[0])[0]

shell(
    "samtools sort {snakemake.params} -@ {snakemake.threads} -o {snakemake.output[0]} "
    "-T {prefix} {snakemake.input[0]}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from snakemake.shell import shell


extra = snakemake.params.get("extra", "")
region = snakemake.params.get("region", "")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)


shell("samtools stats {extra} {snakemake.input}"
      " {region} > {snakemake.output} {log}")
 9
10
script:
    "../scripts/adhoc-calling.py"
14
15
16
shell:
    "(break-point-inspector -vcf {input.manta} -ref {input.samples[1]} -tumor {input.samples[0]} "
    "-output_vcf {output}) > {log} 2>&1"
SnakeMake From line 14 of rules/bpi.smk
29
30
wrapper:
    "0.22.0/bio/bcftools/view"
SnakeMake From line 29 of rules/bpi.smk
40
41
wrapper:
    "0.22.0/bio/bcftools/view"
SnakeMake From line 40 of rules/bpi.smk
16
17
wrapper:
    "0.17.4/bio/delly"
27
28
wrapper:
    "0.17.4/bio/bcftools/concat"
34
35
36
37
run:
    with open(output[0], "w") as out:
        print("tumor", "tumor", file=out)
        print("normal", "control", file=out)
54
55
56
57
shell:
    "delly filter -m 0 -r 1.0 --samples {input.samples} "
    "-o {params.tmp} {input.bcf}; "
    "bcftools view -i INFO/SOMATIC -f PASS -Ob {params.tmp} > {output}"
14
15
script:
    "../scripts/annotate-truth.py"
SnakeMake From line 14 of rules/eval.smk
43
44
shell:
    "bcftools view {params.regions} {input.calls} | rbt vcf-match {params.match} {input.truth} > {output}"
59
60
shell:
    "bcftools view {params.regions} {input.calls} | rbt vcf-match {params.match} {input.truth} > {output}"
70
71
shell:
    "rbt vcf-to-txt --genotypes --info SOMATIC SVLEN SVTYPE TAF NAF < {input} > {output}"
108
109
shell:
    "rbt vcf-to-txt {params.gt} {params.tags} --info MATCHING < {input} > {output}"
122
123
shell:
    "rbt vcf-to-txt {params.gt} {params.tags} < {input} > {output}"
136
137
shell:
    "rbt vcf-to-txt {params.gt} {params.tags} --info MATCHING < {input} > {output}"
154
155
script:
    "../scripts/obtain-tp-fp.py"
SnakeMake From line 154 of rules/eval.smk
211
212
script:
    "../scripts/plot-precision-recall.py"
SnakeMake From line 211 of rules/eval.smk
228
229
script:
    "../scripts/plot-allelefreq-recall.py"
SnakeMake From line 228 of rules/eval.smk
245
246
script:
    "../scripts/plot-fdr-control.py"
SnakeMake From line 245 of rules/eval.smk
260
261
script:
    "../scripts/plot-allelefreq-estimation.py"
SnakeMake From line 260 of rules/eval.smk
275
276
script:
    "../scripts/plot-allelefreq-scatter.py"
SnakeMake From line 275 of rules/eval.smk
289
290
script:
    "../scripts/plot-score-dist.py"
SnakeMake From line 289 of rules/eval.smk
300
301
script:
    "../scripts/bam-stats.py"
SnakeMake From line 300 of rules/eval.smk
315
316
shell:
    "rbt vcf-match {params.match} {params.bcfs[1]} < {params.bcfs[0]} > {output}"
336
337
script:
    "../scripts/aggregate-concordance.py"
SnakeMake From line 336 of rules/eval.smk
350
351
shell:
    "rbt vcf-to-txt {params.gt} {params.tags} --info MATCHING < {input} > {output}"
364
365
script:
    "../scripts/plot-concordance-upset.R"
SnakeMake From line 364 of rules/eval.smk
382
383
script:
    "../scripts/plot-concordance.py"
SnakeMake From line 382 of rules/eval.smk
393
394
script:
    "../scripts/plot-freqdist.py"
SnakeMake From line 393 of rules/eval.smk
 8
 9
10
11
12
13
14
15
16
17
18
shell:
    """
    cd resources
    curl -L https://github.com/nygenome/lancet/archive/{params.commit}.tar.gz | tar -xz
    cd lancet-{params.commit}
    sed -i 's/-Wl,-rpath,$(ABS_BAMTOOLS_DIR)\/lib\///' Makefile
    make clean
    prefix=$(dirname $(dirname $(which $CXX)))
    make CXX=$CXX INCLUDES="-I$prefix/include/ -I$prefix/include/bamtools -L$prefix/lib"
    cp lancet ..
    """
47
48
49
50
51
shell:
   "LD_LIBRARY_PATH=$CONDA_PREFIX/lib "
   "resources/lancet --tumor {input.bams[0]} --normal {input.bams[1]} "
   "--ref {input.ref} --reg {params.region} --num-threads {threads} "
   "{params.extra} > {output} 2> {log}"
62
63
shell:
    "sed -r 's/MS\=[0-9]+[ACGT]+/MS/g' {input.vcf} | bcftools annotate -o {output} -h {input.header} -"
73
74
shell:
    "bcftools concat -Ob {input} > {output}"
87
88
wrapper:
    "0.19.3/bio/bcftools/view"
20
21
22
23
24
shell:
    "rm -rf {params.dir}; "
    "(configManta.py {params.extra} --tumorBam {input.samples[0]} --normalBam {input.samples[1]} "
    "--referenceFasta {input.ref} --runDir {params.dir}; "
    "{params.dir}/runWorkflow.py -m local -j {threads}) > {log} 2>&1"
34
35
wrapper:
    "0.22.0/bio/bcftools/view"
45
46
wrapper:
    "0.22.0/bio/bcftools/view"
59
60
wrapper:
    "0.22.0/bio/bcftools/view"
15
16
wrapper:
    "0.22.0/bio/samtools/sort"
28
29
shell:
    "samtools bam2fq {input} -1 {output.m1} -2 {output.m2} -0 {output.mixed}"
41
42
shell:
    "bwa index -p {params.prefix} {input}"
72
73
74
shell:
    "(resources/bwa mem -t {threads} {params.extra} {params.index} {input.sample} | "
    "samtools view -Sb - > {output}) 2> {log}"
85
86
wrapper:
    "0.22.0/bio/samtools/sort"
 99
100
wrapper:
    "0.22.0/bio/picard/markduplicates"
108
109
wrapper:
    "0.22.0/bio/samtools/index"
8
9
shell:
    "faidx --transform bed {input} > {output}"
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
shell:
    """
    (preprocess.py --mode call --reference {input.ref} \
                  --region_bed {input.bed} --tumor_bam {input.bams[0]} \
                  --normal_bam {input.bams[1]} --work {output.workdir} \
                  --min_mapq 10 --num_threads {threads} \
                  --scan_alignments_binary /opt/neusomatic/neusomatic/bin/scan_alignments

    call.py --candidates_tsv {output.workdir}/dataset/*/candidates*.tsv \
            --reference {input.ref} --out {output.workdir} \
            --checkpoint /opt/neusomatic/neusomatic/models/NeuSomatic_v0.1.4_standalone_SEQC-WGS-Spike.pth \
            --num_threads {threads} \
            --batch_size 100

    python `which postprocess.py` --reference {input.ref} --tumor_bam {input.bams[0]} \
                   --pred_vcf {output.workdir}/pred.vcf \
                   --candidates_vcf {output.workdir}/work_tumor/filtered_candidates.vcf \
                   --output_vcf {output.vcf} \
                   --work {output.workdir}) 2> {log}
    """
59
60
shell:
    "bcftools annotate -Ob -o {output} -h {input.header} {input.vcf}"
74
75
wrapper:
    "0.19.3/bio/bcftools/view"
15
16
shell:
    "run_novobreak {input.ref} {input.bams[0]} {input.bams[1]} {threads} {output.workdir} > {log} 2>&1"
26
27
shell:
    "bcftools view -Ob {input}/novoBreak.pass.flt.vcf > {output}"
37
38
shell:
    "bcftools concat -Ob {input}/split/*.sp.vcf > {output}"
11
12
wrapper:
    "0.22.0/bio/samtools/stats"
24
25
shell:
    "samtools depth {input} | gzip > {output}"
20
21
22
23
24
shell:
    "rm -rf {params.dir}; "
    "(configureStrelkaSomaticWorkflow.py {params.extra} --tumorBam {input.samples[0]} --normalBam {input.samples[1]} "
    "--referenceFasta {input.ref} --runDir {params.dir} --indelCandidates {input.manta}; "
    "{params.dir}/runWorkflow.py -m local -j {threads}) > {log} 2>&1"
35
36
wrapper:
    "0.22.0/bio/bcftools/concat"
49
50
wrapper:
    "0.22.0/bio/bcftools/view"
8
9
shell:
    "cairosvg {input} -o {output}"
39
40
41
42
43
44
45
shell:
    "bcftools view -Ou {input.calls} {params.chrom_prefix} | "
    "varlociraptor call variants {input.ref} "
    "{config[caller][varlociraptor][params]} {params.caller} "
    "tumor-normal {input.bams} "
    "--purity {params.purity} "
    "> {output} 2> {log}"
55
56
wrapper:
    "0.19.1/bio/bcftools/concat"
66
67
shell:
    "varlociraptor filter-calls posterior-odds positive --events SOMATIC_TUMOR < {input} > {output}"
77
78
79
80
shell:
    "varlociraptor filter-calls control-fdr {input} --events SOMATIC_TUMOR --var {wildcards.type} "
    "--minlen {wildcards.minlen} --maxlen {wildcards.maxlen} "
    "--fdr {wildcards.fdr} > {output}"
92
93
wrapper:
    "0.22.0/bio/bcftools/view"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from cyvcf2 import VCF, Writer
import numpy as np

def get_sample_name(tissue):
    ds = snakemake.config["runs"][snakemake.wildcards.run]["dataset"]
    return snakemake.config["datasets"][ds][tissue]["name"]


tumor, normal = map(get_sample_name, ["tumor", "normal"])


bcf_in = VCF(snakemake.input.vcf)
bcf_out = Writer(snakemake.output[0], bcf_in)


for rec in bcf_in:
    if rec.FILTER:
        continue
    gt = rec.genotypes
    tumor_gt = gt[0][:2]
    normal_gt = gt[1][:2]
    if (np.any(tumor_gt) and
        not np.any(normal_gt) and
        not np.any(np.isnan(normal_gt))):
        # somatic variant
        bcf_out.write_record(rec)

bcf_out.close()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
from common import load_variants
import networkx as nx
import pandas as pd
import numpy as np

vartype = snakemake.wildcards.vartype

index_cols = ["CHROM", "POS", "SVLEN"] if vartype == "INS" or vartype == "DEL" else ["CHROM", "POS", "ALT"]

all_variants = [load_variants(f, vartype=vartype) for f in snakemake.input.calls]

G = nx.Graph()
for calls, (i, j) in zip(all_variants, snakemake.params.dataset_combinations):
    calls["component"] = None
    for call in calls.itertuples():
        a = (i, call.Index)
        G.add_node(a)
        if call.MATCHING >= 0:
            b = (j, call.MATCHING)
            G.add_node(b)
            G.add_edge(a, b)

# get a set of calls for each dataset (we don't need all pairwise comparisons for that)
representatives = {snakemake.params.dataset_combinations[i][0]: calls for i, calls in enumerate(all_variants)}

if snakemake.wildcards.mode != "varlociraptor":
    varlociraptor_variants = [load_variants(f, vartype=vartype) for f in snakemake.input.varlociraptor_calls]
    for calls in varlociraptor_variants:
        calls.set_index(index_cols, inplace=True)
    varlociraptor_representatives = {snakemake.params.dataset_combinations[i][0]: calls for i, calls in enumerate(varlociraptor_variants)}

# annotate calls with their component, i.e. their equivalence class
for component_id, component in enumerate(nx.connected_components(G)):
    for i, k in component:
        representatives[i].loc[k, "component"] = component_id
for calls in representatives.values():
    calls["component"] = calls["component"].astype(np.float32)
    calls.set_index("component", inplace=True)

# join calls based on their equivalence class
aggregated = None
suffix = "_{}".format
dataset_name = lambda i: snakemake.params.datasets[i]
is_varlociraptor = False
for dataset_id, calls in representatives.items():
    cols = list(index_cols)
    if "CASE_AF" in calls.columns:
        cols.extend(["CASE_AF", "PROB_SOMATIC_TUMOR"])
        is_varlociraptor = True
    calls = calls[cols]
    if snakemake.wildcards.mode != "varlociraptor":
        idx_calls = calls.set_index(cols, drop=False)
        caseaf = idx_calls.join(varlociraptor_representatives[dataset_id][["CASE_AF"]], how="left")["CASE_AF"]
        caseaf = caseaf[~caseaf.index.duplicated()]
        calls = calls[~idx_calls.index.duplicated()]
        calls["CASE_AF"] = caseaf.values

    calls.columns = [c + suffix(dataset_name(dataset_id)) for c in calls.columns]
    if aggregated is None:
        aggregated = calls
    else:
        aggregated = aggregated.join(calls, how="outer", lsuffix="", rsuffix="")

# Forget the component id. Otherwise, we might run into errors with duplicate elements
# in the index below. These can occur if there are multiple ambiguous calls.
aggregated.reset_index(inplace=True, drop=True)

pos_cols = aggregated.columns[aggregated.columns.str.startswith("POS_")]
is_called = (~aggregated[pos_cols].isnull()).astype(int)
is_called.columns = pos_cols.str.replace("POS_", "")
aggregated = aggregated.join(is_called, lsuffix="", rsuffix="")

aggregated.insert(len(aggregated.columns), "concordance_count", is_called.sum(axis=1))

aggregated["max_case_af"] = aggregated[aggregated.columns[aggregated.columns.str.startswith("CASE_AF")]].max(axis=1)
if is_varlociraptor:
    aggregated["max_prob_somatic_tumor"] =  aggregated[aggregated.columns[aggregated.columns.str.startswith("PROB_SOMATIC")]].min(axis=1)

aggregated.to_csv(snakemake.output[0], sep="\t", index=False)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
from cyvcf2 import VCF, Writer
import numpy as np


def subclone_vaf(gt):
    """Calculate subclone allele frequency"""
    if np.all(gt[:2] == [1, 1]):
        return 1.0
    elif (np.all(gt[:2] == [0, 1]) or np.all(gt[:2] == [1, 0]) or
          np.all(gt[:2] == [-1, 1]) or np.all(gt[:2] == [1, -1])):
        return 0.5
    else:
        return 0.0


# Reader
vcf_in = VCF(snakemake.input[0])

# Setup subclone information
subclones = ["Som{}".format(i) for i in range(1, 5)]
fractions = [1/3, 1/3, 1/4, 1/12]


# Prepare writer
vcf_in.add_info_to_header({"ID": "TAF",
                           "Number": "1",
                           "Description": "True tumor allele frequency",
                           "Type": "Float"})
vcf_in.add_info_to_header({"ID": "NAF",
                           "Number": "1",
                           "Description": "True normal allele frequency",
                           "Type": "Float"})
bcf_out = Writer(snakemake.output[0], vcf_in)

for rec in vcf_in:
    if len(rec.ALT) > 1:
        raise ValueError("multiallelic sites are not supported at the moment")

    try:
        # get VAFs from VCF
        tumor_vaf = rec.INFO["TAF"]
        normal_vaf = rec.INFO["NAF"]
    except KeyError:
        # calculate VAFs
        subclone_idx = [vcf_in.samples.index(s) for s in subclones]
        control_idx = vcf_in.samples.index("Control")

        tumor_vaf = sum(fraction * subclone_vaf(rec.genotypes[idx])
                        for idx, fraction in zip(subclone_idx, fractions))
        normal_vaf = subclone_vaf(rec.genotypes[control_idx])

        rec.INFO["TAF"] = tumor_vaf
        rec.INFO["NAF"] = normal_vaf

    # only keep somatic variants
    if normal_vaf == 0.0 and tumor_vaf > 0.0:
        bcf_out.write_record(rec)

bcf_out.close()
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import sys
import numpy as np
import pandas as pd
import pysam
import matplotlib
matplotlib.use("agg")
import matplotlib.pyplot as plt
import seaborn as sns
from functools import partial

tumor = pysam.AlignmentFile(snakemake.input[0], "rb")
normal = pysam.AlignmentFile(snakemake.input[1], "rb")

softclips = []

for i, rec in enumerate(normal):
    if rec.is_supplementary or rec.is_unmapped:
        continue
    is_first_read = rec.pos < rec.mpos
    get_clip = lambda c: c[1] if c[0] == 4 else None
    clip_left = get_clip(rec.cigartuples[0])
    if clip_left is not None:
        softclips.append([clip_left, True, is_first_read])
    clip_right = get_clip(rec.cigartuples[-1])
    if clip_right is not None:
        softclips.append([clip_right, False, is_first_read])
    if i == 10000000:
        break

softclips = pd.DataFrame(softclips, columns=["len", "left", "first_in_pair"])

def plot(*args, **kwargs):
    softclips = args[0]
    plt.hist(softclips, normed=True)
    q95 = np.percentile(softclips, 99)
    plt.plot([q95, q95], [0, 1.0], "--k")
    m = max(softclips)
    plt.plot([m, m], [0, 1.0], ":k")
    plt.text(m, 1, "max={}".format(m), horizontalalignment="right", verticalalignment="top")


g = sns.FacetGrid(softclips, col="left", row="first_in_pair")
g = g.map(plot, "len")

plt.savefig(snakemake.output[0])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import pandas as pd
import numpy as np
from common import load_variants


minlen = int(snakemake.wildcards.minlen)
maxlen = int(snakemake.wildcards.maxlen)
vartype = snakemake.wildcards.vartype

if snakemake.wildcards.mode == "varlociraptor":
    score = snakemake.config["caller"]["varlociraptor"]["score"]
    # calls are already filtered by FDR control step
    minlen = None
    maxlen = None
elif snakemake.wildcards.mode == "default":
    score = snakemake.config["caller"][snakemake.wildcards.caller]["score"]
else:
    score = None

calls = load_variants(snakemake.input.calls, vartype=vartype, minlen=minlen, maxlen=maxlen)

calls["is_tp"] = calls["MATCHING"] >= 0

calls["score"] = calls[score] if score else np.nan

calls.to_csv(snakemake.output[0], sep="\t")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
from itertools import product
import math
import matplotlib
matplotlib.use("agg")
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import common
import numpy as np


MIN_CALLS = 10

vartype = snakemake.wildcards.vartype
colors = common.get_colors(snakemake.config)

truth = common.load_variants(snakemake.input.truth, vartype=vartype)

def props(callers):
    return product(callers, snakemake.params.len_ranges)

def plot_len_range(minlen, maxlen):
    def plot(calls, colors):
        calls = calls[calls.is_tp]
        true_af = truth.loc[calls.MATCHING].reset_index().TAF
        calls = calls.reset_index()
        calls["error"] = calls.CASE_AF - true_af

        if calls.empty:
            return

        calls["true_af"] = true_af
        true_af = pd.Series(calls["true_af"].unique()).sort_values()
        # standard deviation when sampling in binomial process from allele freq
        # this is the expected sampling error within the correctly mapped fragments
        # sd = true_af.apply(lambda af: 1 / 40 * math.sqrt(40 * af * (1 - af)))
        # x = np.arange(len(true_af))
        # offsets = [-0.5, 0.5]
        # y_upper = np.array([v for v in sd for o in offsets])
        #  y_lower = np.maximum(-y_upper, [-f for f in true_af for o in offsets])
        # plt.fill_between([v + o for v in x for o in offsets], y_lower, y_upper, color="#EEEEEE", zorder=-5)

        calls["true_af"] = calls["true_af"].apply("{:.3f}".format)

        size = 1 if maxlen == 30 else 2
        sns.stripplot("true_af", "error", hue="caller", data=calls, palette=colors, dodge=True, jitter=True, alpha=0.5, size=size, rasterized=True)
        sns.boxplot("true_af", "error", hue="caller", data=calls, color="white", fliersize=0, linewidth=1)

        handles, labels = plt.gca().get_legend_handles_labels()
        n = len(calls.caller.unique())

        plt.ylim((-1,1))
        plt.grid(axis="y", linestyle=":", color="grey")
        sns.despine()
        plt.xticks(rotation="vertical")
        ax = plt.gca()
        ax.legend().remove()

        return ax, handles[n:]

    all_calls, all_colors = load_calls(minlen, maxlen)
    return plot(all_calls, all_colors)

def load_calls(minlen, maxlen):
    all_calls = []
    all_colors = []
    for calls, (caller, len_range) in zip(snakemake.input.varlociraptor_calls, props(snakemake.params.varlociraptor_callers)):
        if len_range[0] != minlen and len_range[1] != maxlen:
            continue
        label = "varlociraptor+{}".format(caller)
        calls = pd.read_table(calls)
        calls["caller"] = label
        if not calls.empty:
            all_calls.append(calls)
            all_colors.append(colors[caller])

    all_calls = pd.concat(all_calls)
    return all_calls, all_colors

common.plot_ranges(
    snakemake.params.len_ranges,
    plot_len_range,
    xlabel="true allele frequency",
    ylabel="predicted - truth")

plt.savefig(snakemake.output[0], bbox_inches="tight")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
from itertools import product
import matplotlib
matplotlib.use("agg")
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import common
import numpy as np
import math
from matplotlib.lines import Line2D

MIN_CALLS = 10

vartype = snakemake.wildcards.vartype
colors = common.get_colors(snakemake.config)


def props(callers):
    return product(callers, snakemake.params.len_ranges)


def plot_len_range(minlen, maxlen):

    truth = common.load_variants(
        snakemake.input.truth, minlen, maxlen, vartype=vartype)

    afs = pd.Series(truth.TAF.unique()).sort_values()

    def plot(calls,
             label,
             color,
             varlociraptor=True,
             style="-.",
             markersize=4):
        calls = pd.read_table(calls, index_col=0)
        if len(calls) < 10:
            return
        if varlociraptor:
            phred = lambda p: -10 * math.log10(p)
            def calc_recall(p):
                c = calls[calls.score <= phred(p)]
                return [common.recall(c, truth[truth.TAF >= af]) for af in afs]

            return plt.fill_between(
                afs,
                calc_recall(0.98 if maxlen > 30 else 0.99),
                calc_recall(0.9),
                color=color,
                label=label,
                alpha=0.6)
        else:
            recall = [common.recall(calls, truth[truth.TAF >= af]) for af in afs]
            # plot a white background first to increase visibility
            plt.plot(afs, recall, "-", color="white", alpha=0.8)
            return plt.plot(
                afs,
                recall,
                style,
                color=color,
                label=label)[0]

    handles = []
    def register_handle(handle):
        if handle is not None:
            handles.append(handle)
    for calls, (caller,
                len_range) in zip(snakemake.input.varlociraptor_calls,
                                  props(snakemake.params.varlociraptor_callers)):
        if len_range[0] != minlen and len_range[1] != maxlen:
            continue
        label = "varlociraptor+{}".format(caller)
        handle = plot(calls, label, colors[caller], varlociraptor=True)
        register_handle(handle)
        #handles.append(Line2D([0], [0], color=colors[caller], label=label))

    for calls, (caller, len_range) in zip(snakemake.input.adhoc_calls,
                             props(snakemake.params.adhoc_callers)):
        if len_range[0] != minlen and len_range[1] != maxlen:
            continue
        color = colors[caller]
        handle = plot(calls, caller, color, style=":", varlociraptor=False)
        register_handle(handle)
        #handles.append(Line2D([0], [0], linestyle=":", color=color, label=caller))

    sns.despine()
    ax = plt.gca()
    return ax, handles


common.plot_ranges(
    snakemake.params.len_ranges,
    plot_len_range,
    xlabel="allele frequency",
    ylabel="recall")

plt.savefig(snakemake.output[0], bbox_inches="tight")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import math
import matplotlib
matplotlib.use("agg")
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import common
import numpy as np

MIN_COUNT = 20
MAX_DEPTH = 60

vartype = snakemake.wildcards.vartype
colors = common.get_colors(snakemake.config)

truth = common.load_variants(snakemake.input.truth, vartype=vartype)

all_calls = []
for caller, calls in zip(snakemake.params.callers, snakemake.input.calls):
    calls = pd.read_table(calls)
    calls.loc[:, "caller"] = caller
    all_calls.append(calls)
all_calls = pd.concat(all_calls)

def plot(af, _):
    constrain_lower = lambda error: np.maximum(error, -af)
    constrain_upper = lambda error: np.minimum(error, 1.0 - af)

    dp = all_calls["TUMOR_DP"]
    calls = all_calls[all_calls.is_tp]
    true_af = truth.loc[calls.MATCHING].reset_index().TAF
    calls = calls.reset_index()
    calls["true_af"] = true_af
    calls = calls[calls["true_af"] == af]
    calls["error"] = calls.CASE_AF - true_af

    sns.kdeplot(calls["TUMOR_DP"], calls["error"], cmap="Blues", n_levels=50, shade=True, alpha=0.7, shade_lowest=False) #alpha=0.5, clip=((0.0, 1.0), (0.0, af)))
    plt.plot(calls["TUMOR_DP"], calls["error"], ",", color="k", lw=0, alpha=1.0, rasterized=True)
    by_depth = calls.groupby("TUMOR_DP")["error"].describe().reset_index()
    by_depth["-std"] = constrain_lower(-by_depth["std"])
    by_depth["std"] = constrain_upper(by_depth["std"])
    by_depth = by_depth[by_depth["count"] >= MIN_COUNT]
    plt.plot(by_depth.TUMOR_DP, by_depth["std"], "--", color="k")
    plt.plot(by_depth.TUMOR_DP, by_depth["-std"], "--", color="k")
    plt.plot(by_depth.TUMOR_DP, by_depth["mean"], "-", color="k")

    depths = np.arange(0, MAX_DEPTH)
    # standard deviation when sampling in binomial process from allele freq
    # this is the expected sampling error within the correctly mapped fragments
    sd = np.array([1.0 / depth * math.sqrt(depth * af * (1.0 - af)) for depth in depths])
    plt.fill_between(depths, constrain_lower(-sd), constrain_upper(sd), color="grey", alpha=0.5)

    sns.despine()
    plt.xticks(rotation="vertical")
    ax = plt.gca()
    ax.legend().remove()
    handles, labels = ax.get_legend_handles_labels()
    plt.ylim((-1.0, 1.0))
    plt.xlim((0, MAX_DEPTH))

    return ax, []

afs = [(af, af) for af in truth.TAF.sort_values().unique()]

common.plot_ranges(
    afs,
    plot,
    "depth",
    "predicted - truth")

plt.savefig(snakemake.output[0], bbox_inches="tight")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
from itertools import product
import matplotlib
matplotlib.use("agg")
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import common
import numpy as np
import math
from matplotlib.lines import Line2D
from matplotlib.colors import to_rgba


class NotEnoughObservationsException(Exception):
    pass


MIN_CALLS = 20
MAX_LEN = 1000

vartype = snakemake.wildcards.vartype
colors = common.get_colors(snakemake.config)



varlociraptor_calls_low = [pd.read_table(f) for f in snakemake.input.varlociraptor_calls_low]
varlociraptor_calls_high = [pd.read_table(f) for f in snakemake.input.varlociraptor_calls_high]
adhoc_calls = [pd.read_table(f) for f in snakemake.input.adhoc_calls]


def expected_count(af, effective_mutation_rate):
    """Calculate the expected number of somatic variants
       greater than a given allele frequency given an effective mutation
       rate, according to the model of Williams et al. Nature 
       Genetics 2016"""
    return effective_mutation_rate * (1.0 / af - 1.0)


def expected_counts(afs, effective_mutation_rate):
    return [expected_count(af, effective_mutation_rate) for af in afs]


def calc_concordance(calls):
    n = len(calls)
    return (calls["concordance_count"] > 1).sum() / n


def plot_len_range(minlen, maxlen, yfunc=None, yscale=None, upper_bound=None):
    handles_varlociraptor = []
    handles_adhoc = []
    for i, caller in enumerate(snakemake.params.callers):
        def plot_calls(calls, label, color, style, calls_lower=None):
            def get_xy(calls, caseafs=None):
                svlen = calls.loc[:, calls.columns.str.startswith("SVLEN")].abs()
                # at least one of the calls has a valid svlen
                valid = ((svlen >= minlen) & (svlen <= maxlen)).sum(axis=1) >= 1
                calls = calls[valid]
                if caseafs is None:
                    caseafs = calls["max_case_af"].dropna().unique()
                y = []
                _caseafs = []
                for caseaf in sorted(caseafs):
                    _calls = calls[calls["max_case_af"] >= caseaf]
                    if upper_bound is not None:
                        _calls = _calls[_calls["max_case_af"] <= caseaf + upper_bound]
                    if len(_calls) < MIN_CALLS:
                        continue
                    _caseafs.append(caseaf)
                    y.append(yfunc(_calls))
                return _caseafs, y

            x, y = get_xy(calls)
            if not x:
                raise NotEnoughObservationsException()
            if calls_lower is not None:
                _, y2 = get_xy(calls_lower, caseafs=x)
                return plt.fill_between(x, y, y2, label=label, edgecolor=color, facecolor=to_rgba(color, alpha=0.2))
            else:
                if style != "-":
                    plt.plot(x, y, "-", color="white", alpha=0.8)
                return plt.plot(x, y, style, label=label, color=color)[0]

        color = colors[snakemake.params.callers[i]]
        try:
            handles_varlociraptor.append(
                plot_calls(
                    varlociraptor_calls_high[i], 
                    "varlociraptor+{}".format(caller), 
                    color=color, style="-", 
                    calls_lower=varlociraptor_calls_low[i]))
        except NotEnoughObservationsException:
            # skip plot
            pass
        try:
            handles_adhoc.append(plot_calls(adhoc_calls[i], caller, color=color, style=":"))
        except NotEnoughObservationsException:
            # skip plot
            pass

    handles = handles_varlociraptor + handles_adhoc
    sns.despine()
    ax = plt.gca()
    if yscale is not None:
        ax.set_yscale(yscale)
    return ax, handles

plt.figure(figsize=(10, 4))
plt.subplot(121)
plot_len_range(1, MAX_LEN, yfunc=calc_concordance)
plt.xlabel("$\geq$ tumor allele frequency")
plt.ylabel("concordance")

plt.subplot(122)
for effective_mutation_rate in 10 ** np.linspace(1, 5, 7):
    afs = np.linspace(0.0, 1.0, 100, endpoint=False)
    plt.semilogy(afs, expected_counts(afs, effective_mutation_rate), "-", color="grey", alpha=0.4)

ax, handles = plot_len_range(1, MAX_LEN, yfunc=lambda calls: len(calls), yscale="log")

plt.xlabel("$\geq$ tumor allele frequency")
plt.ylabel("# of calls")

ax.legend(handles=handles, loc="upper left", bbox_to_anchor=(1.0, 1.0))

plt.tight_layout()

plt.savefig(snakemake.output[0], bbox_inches="tight")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
library(UpSetR)
library(plyr)
library(ggplot2)
library(grid)
library(rlist)

calls <- read.table(snakemake@input[[1]], row.names = NULL, sep = "\t", header = TRUE, check.names = FALSE)
#print(dim(calls))
calls <- calls[calls$max_case_af >= 0.1, ]
#calls <- calls[calls$max_prob_somatic_tumor < 0.07]_
#calls$max_case_af_log10 <- log10(calls$max_case_af)

queries <- list()
for(ds in snakemake@params[["datasets"]]) {
    queries <- list.append(queries, list(query=intersects, params=list(ds)))
}


svg(file = snakemake@output[[1]])
if(!empty(calls)) {
    upset(calls, 
          order.by = "freq", 
          sets = snakemake@params[["datasets"]],
          queries = queries
    )
}
dev.off()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
from itertools import product
import matplotlib
matplotlib.use("agg")
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import common
import numpy as np


MIN_CALLS = 100

colors = common.get_colors(snakemake.config)

props = product(snakemake.params.callers,
                snakemake.params.len_ranges, snakemake.params.fdrs)

calls = []

for _calls, (caller, len_range, fdr) in zip(snakemake.input.varlociraptor_calls, props):
    calls.append({"caller": caller, "len_range": len_range, "fdr": float(fdr), "calls": _calls})

calls = pd.DataFrame(calls)
calls = calls.set_index("caller", drop=False)


def plot_len_range(minlen, maxlen):

    def plot(caller):
        color = colors[caller]
        label = "varlociraptor+{}".format(caller)
        fdrs = []
        alphas = []
        calls_ = calls.loc[caller]
        calls_ = calls_[calls_["len_range"].map(lambda r: r == [minlen, maxlen])]
        calls_ = calls_.sort_values("fdr")
        for e in calls_.itertuples():
            c = pd.read_table(e.calls)
            n = c.shape[0]
            if n < MIN_CALLS:
                continue
            true_fdr = 1.0 - common.precision(c)
            if fdrs and fdrs[-1] == true_fdr:
                continue
            fdrs.append(true_fdr)
            alphas.append(e.fdr)
        plt.plot(alphas, fdrs, ".-", color=color, label=label)


    for caller in calls.index.unique():
        plot(caller)

    plt.plot([0, 1], [0, 1], ":", color="grey")

    sns.despine()
    ax = plt.gca()
    handles, _ = ax.get_legend_handles_labels()
    return ax, handles

common.plot_ranges(
    snakemake.params.len_ranges,
    plot_len_range,
    xlabel="FDR threshold",
    ylabel="true FDR")

plt.savefig(snakemake.output[0], bbox_inches="tight")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
from pysam import VariantFile
import seaborn as sns
import matplotlib.pyplot as plt
from itertools import islice
import pandas as pd


sns.set_style("ticks")


calls = VariantFile(snakemake.input[0])


def freq(sample):
    ref, alt = sample.get("AD")
    if alt == 0:
        return 0
    return alt / (ref + alt)


normal_freqs = []
tumor_freqs = []
vartypes = []

print("collecting calls")
for record in calls:
    if not record.info.get("SHARED"):
        continue
    normal_freqs.append(freq(record.samples.get("normal")))
    tumor_freqs.append(freq(record.samples.get("tumor")))
    vartypes.append(record.info.get("TYPE"))

d = pd.DataFrame({"normal": normal_freqs, "tumor": tumor_freqs, "type": vartypes})

print("subsampling")
# sample d to not get overwhelmed
d = d.sample(10000, random_state=245746)

print("plotting and density estimation")
g = sns.FacetGrid(col="type", data=d)
g.map(sns.kdeplot, "normal", "tumor", shade=True, clip=(0, 1), shade_lowest=False)
g.map(sns.scatterplot, "normal", "tumor", size=1, alpha=0.5, marker=".")

plt.savefig(snakemake.output[0], bbox_inches="tight")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
from itertools import product
from functools import partial
import matplotlib
matplotlib.use("agg")
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import common
import numpy as np
import math
from matplotlib.lines import Line2D

vartype = snakemake.wildcards.vartype
colors = common.get_colors(snakemake.config)


def props(callers):
    return product(callers, snakemake.params.len_ranges)


def plot_len_range(minlen, maxlen, min_precision=0.0):

    truth = common.load_variants(
        snakemake.input.truth, minlen, maxlen, vartype=vartype)

    def plot(calls,
             label,
             color,
             line=True,
             style="-",
             invert=False,
             markersize=4,
             endmarker=False):
        calls = pd.read_table(calls, index_col=0)
        if len(calls) < 10:
            return
        if line:
            thresholds = calls.score.quantile(np.linspace(0.0, 1.0, 50))
            precision = []
            recall = []
            for t in thresholds:
                if invert:
                    c = calls[calls.score >= t]
                else:
                    c = calls[calls.score <= t]
                p = common.precision(c)
                r = common.recall(c, truth)
                print(label, t, c.shape[0], p, r)
                if len(c) < 10:
                    print("skipping threshold: too few calls", c)
                    continue
                precision.append(p)
                recall.append(r)
            if len(precision) <= 2:
                print("skipping curve because we have too few values")
                return
        else:
            precision = [common.precision(calls)]
            recall = [common.recall(calls, truth)]
            style = "."
            print(label, calls.shape[0], precision, recall)

        plt.plot(
            recall,
            precision,
            style,
            color=color,
            label=label,
            markersize=markersize
        )
        if endmarker:
            plt.plot(recall[-1], precision[-1], "s", color=color, markersize=markersize)

    handles = []
    for calls, (caller,
                len_range) in zip(snakemake.input.varlociraptor_calls,
                                  props(snakemake.params.varlociraptor_callers)):
        if len_range[0] != minlen and len_range[1] != maxlen:
            continue
        label = "varlociraptor+{}".format(caller)
        plot(calls, label, colors[caller], endmarker=True)
        handles.append(Line2D([0], [0], color=colors[caller], label=label))

    for calls, (caller,
                len_range) in zip(snakemake.input.default_calls,
                                  props(snakemake.params.default_callers)):
        if len_range[0] != minlen and len_range[1] != maxlen:
            continue
        color = colors[caller]
        plot(
            calls,
            caller,
            color,
            style=":",
            invert=snakemake.config["caller"][caller].get("invert", False))
        if caller in snakemake.params.adhoc_callers:
            handles.append(Line2D([0], [0], markersize=10, markerfacecolor=color, markeredgecolor=color, color=color, label=caller, marker=".", linestyle=":"))
        else:
            handles.append(Line2D([0], [0], color=color, label=caller, linestyle=":"))

    for calls, (caller, len_range) in zip(snakemake.input.adhoc_calls,
                             props(snakemake.params.adhoc_callers)):
        if len_range[0] != minlen and len_range[1] != maxlen:
            continue
        color = colors[caller]
        plot(calls, caller, color, markersize=10, line=False)
        if caller not in snakemake.params.default_callers:
            handles.append(Line2D([0], [0], markersize=10, markerfacecolor=color, markeredgecolor=color, label=caller, marker=".", lw=0))

    sns.despine()
    ax = plt.gca()
    plt.ylim((min_precision, 1.01 if min_precision == 0.0 else 1.001))
    return ax, handles


plot = plot_len_range
fig_height = None
legend_outside = snakemake.params.legend_outside
if snakemake.wildcards.zoom == "zoom":
    plot = partial(plot_len_range, min_precision=0.99 if vartype == "INS" else 0.95)
    fig_height = 3
    legend_outside = True


common.plot_ranges(
    snakemake.params.len_ranges,
    plot,
    xlabel="recall",
    ylabel="precision",
    fig_height=fig_height,
    legend_outside=legend_outside,
)

plt.savefig(snakemake.output[0], bbox_inches="tight")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from itertools import product
import matplotlib
matplotlib.use("agg")
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import common
import numpy as np
import math

vartype = snakemake.wildcards.vartype
colors = common.get_colors(snakemake.config)

def props(callers):
    return product(callers, snakemake.params.len_ranges)

phred_to_log_factor = -0.23025850929940456
log_to_phred_factor = -4.3429448190325175

def plot_len_range(minlen, maxlen):
    for calls, (caller, len_range) in zip(snakemake.input.varlociraptor_calls, props(snakemake.params.varlociraptor_callers)):
        if len_range[0] != minlen and len_range[1] != maxlen:
            continue
        label = "varlociraptor+{}".format(caller)
        calls = pd.read_table(calls)
        calls["caller"] = label
        if not calls.empty:
            color = colors[caller]
            sns.kdeplot(calls[calls.is_tp].PROB_SOMATIC_TUMOR.map(np.log), color=color, label=label)
            sns.kdeplot(calls[~calls.is_tp].PROB_SOMATIC_TUMOR.map(np.log), color=color, linestyle=":", label="")

    ax = plt.gca()
    fmt_ticks = lambda ticks: ["{:.1g}".format(np.exp(t)) for t in ticks]
    ax.set_xticklabels(fmt_ticks(plt.xticks()[0]))
    ax.legend().remove()
    handles, _ = ax.get_legend_handles_labels()
    sns.despine()

    return ax, handles

common.plot_ranges(
    snakemake.params.len_ranges,
    plot_len_range,
    xlabel=r"$-10 \log_{10}$ Pr(somatic)",
    ylabel="density")

plt.savefig(snakemake.output[0], bbox_inches="tight")
114
115
shell:
    "bcftools index {input}"
125
126
127
128
129
130
run:
    calls = pd.read_table(input[0], header=[0, 1])
    calls = calls["VARIANT"]
    calls = calls[(calls.MATCHING >= 0 if params.type else calls.MATCHING < 0)]
    calls.sort_values("PROB_SOMATIC_TUMOR", ascending=not params.type, inplace=True)
    calls[["CHROM", "POS", "PROB_SOMATIC_TUMOR", "MATCHING"]].to_csv(output[0], sep="\t")
SnakeMake From line 125 of master/Snakefile
154
155
156
157
158
shell:
    "bcftools view {input.bcf} {wildcards.varpos} > {output.vcf}; "
    "samtools view -b {input.bams[0]} {params.region} > {output.tumor}; "
    "samtools view -b {input.bams[1]} {params.region} > {output.normal}; "
    "samtools index {output.tumor}; samtools index {output.normal}"
ShowHide 78 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/varlociraptor/varlociraptor-evaluation
Name: varlociraptor-evaluation
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...