Snakemake-based workflow for detecting structural variants in genomic data

public public 1yr ago Version: v1.2.2 0 bookmarks

Structural variants (SVs) are an important class of genetic variation implicated in a wide array of genetic diseases. sv-callers is a Snakemake -based workflow that combines several state-of-the-art tools for detecting SVs in whole genome sequencing (WGS) data. The workflow is easy to use and deploy on any Linux-based machine. In particular, the workflow supports automated software deployment, easy configuration and addition of new analysis tools as well as enables to scale from a single computer to different HPC clusters with minimal effort.

Dependencies

  • Python

  • Conda - package/environment management system

  • Snakemake - workflow management system

  • Xenon CLI - command-line interface to compute and storage resources

  • jq - command-line JSON processor (optional)

  • YAtiML - library for YAML type inference and schema validation

The workflow includes the following bioinformatics tools:

The software dependencies can be found in the conda environment files: [1] , [2] , [3] .

1. Clone this repo.

git clone https://github.com/GooglingTheCancerGenome/sv-callers.git
cd sv-callers

2. Install dependencies.

# download Miniconda3 installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# install Conda (respond by 'yes')
bash miniconda.sh
# update Conda
conda update -y conda
# install Mamba
conda install -n base -c conda-forge -y mamba
# create a new environment with dependencies & activate it
mamba env create -n wf -f environment.yaml
conda activate wf

3. Configure the workflow.

  • config files :

    • analysis.yaml - analysis-specific settings (e.g., workflow mode, I/O files, SV callers, post-processing or resources used etc.)

    • samples.csv - list of (paired) samples

  • input files :

    • example data in workflow/data directory

    • reference genome in .fasta (incl. index files)

    • excluded regions in .bed (optional)

    • WGS samples in .bam (incl. index files)

  • output files :

    • (filtered) SVs per caller and merged calls in .vcf (incl. index files)

4. Execute the workflow.

cd workflow

Locally

# 'dry' run only checks I/O files
snakemake -np
# 'vanilla' run if echo_run set to 1 (default) in analysis.yaml,
# it merely mimics the execution of SV callers by writing (dummy) VCF files;
# SV calling if echo_run set to 0
snakemake --use-conda --jobs

Submit jobs to Slurm or GridEngine cluster

SCH=slurm # or gridengine
snakemake --use-conda --latency-wait 30 --jobs \
--cluster "xenon scheduler $SCH --location local:// submit --name smk.{rule} --inherit-env --cores-per-task {threads} --max-run-time 1 --max-memory {resources.mem_mb} --working-directory . --stderr stderr-%j.log --stdout stdout-%j.log" &>smk.log&

Note: One sample or a tumor/normal pair generates in total 18 SV calling and post-processing jobs. See the workflow instance of single-sample (germline) or paired-sample (somatic) analysis.

To perform SV calling:

  • edit (default) parameters in analysis.yaml

    • set echo_run to 0

    • choose between two workflow mode s: single- ( s ) or paired-sample ( p - default)

    • select one or more callers using enable_callers (default all)

  • use xenon CLI to set:

    • --max-run-time of workflow jobs (in minutes)

    • --temp-space (optional, in MB)

  • adjust compute requirements per SV caller according to the system used:

    • the number of threads ,

    • the amount of memory (in MB),

    • the amount of temporary disk space or tmpspace (path in TMPDIR env variable) can be used for intermediate files by LUMPY and GRIDSS only.

Query job accounting information

SCH=slurm # or gridengine
xenon --json scheduler $SCH --location local:// list --identifier [jobID] | jq ...

Code Snippets

24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
shell:
    """
    set -xe

    OUTDIR="$(dirname "{output}")"
    PREFIX="$(basename "{output}" .bcf)"
    OUTFILE="${{OUTDIR}}/${{PREFIX}}.unfiltered.bcf"
    TSV="${{OUTDIR}}/sample_pairs.tsv"

    # fetch sample ID from a BAM file
    function get_samp_id() {{
        echo "$(samtools view -H ${{1}} | \
               perl -lne 'print ${{1}} if /\sSM:(\S+)/' | \
               head -n 1)"
    }}

    # run dummy or real job
    if [ "{config.echo_run}" -eq "1" ]; then
        echo "{input}" > "{output}"
    else
        # use OpenMP for threaded jobs
        export OMP_NUM_THREADS={threads}
        #export OMP_PROC_BIND=true
        #export OMP_PLACES=threads

        # SV calling
        delly call \
            -t "{wildcards.sv_type}" \
            -g "{input.fasta}" \
            -o "${{OUTFILE}}" \
            -q 1 `# min.paired-end mapping quality` \
            -s 9 `# insert size cutoff, DELs only` \
            {params.excl_opt} \
            "{input.tumor_bam}" "{input.normal_bam}"
        # somatic + SV quality filtering
        #   create sample list
        TID=$(get_samp_id "{input.tumor_bam}")
        CID=$(get_samp_id "{input.normal_bam}")
        printf "${{TID}}\ttumor\n${{CID}}\tcontrol\n" > ${{TSV}}
        delly filter \
            -f somatic \
            -t "{wildcards.sv_type}" \
            -p \
            -s "${{TSV}}" \
            -o "{output}" \
            "${{OUTFILE}}"
    fi
    """
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
shell:
    """
    set -xe

    OUTDIR="$(dirname "{output}")"
    PREFIX="$(basename "{output}" .bcf)"
    OUTFILE="${{OUTDIR}}/${{PREFIX}}.unfiltered.bcf"

    # run dummy or real job
    if [ "{config.echo_run}" -eq "1" ]; then
        echo "{input}" > "{output}"
    else
        # use OpenMP for threaded jobs
        export OMP_NUM_THREADS={threads}

        # SV calling
        delly call \
            -t "{wildcards.sv_type}" \
            -g "{input.fasta}" \
            -o "${{OUTFILE}}" \
            -q 1 `# min.paired-end mapping quality` \
            -s 9 `# insert size cutoff, DELs only` \
            {params.excl_opt} \
            "{input.bam}"
        # SV quality filtering
        bcftools filter \
            -O b `# compressed BCF format` \
            -o "{output}" \
            -i "FILTER == 'PASS'" \
            "${{OUTFILE}}"
        # index BCF file
        bcftools index "{output}"
    fi
    """
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
shell:
    """
     set -x

     # run dummy or real job
     if [ "{config.echo_run}" -eq "1" ]; then
         cat {input} > "{output}"
     else
         # concatenate rather than merge BCF files
         bcftools concat \
            -a `# allow overlaps` \
            -O v `# uncompressed VCF format` \
            -o "{output}" \
            {input}
    fi
    """
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
shell:
    """
    set -xe

    # if 'tmpspace' set to >0MB use TMPDIR otherwise use OUTDIR
    OUTDIR="$(dirname "{output}")"
    PREFIX="$(basename "{output}" .vcf)"
    OUTFILE="${{OUTDIR}}/${{PREFIX}}.unfiltered.vcf"
    TMP=$([ "{resources.tmp_mb}" -eq "0" ] &&
        echo "${{OUTDIR}}" || echo "${{TMPDIR}}")

    # set JVM max. heap size dynamically (in GB)
    # N.B. don't allocate >31G due to Compressed Oops and JDK-8029679
    MAX_HEAP=$(LC_ALL=C printf "%.f" $(bc <<< "scale=2; \
        {resources.mem_mb} / 1024 * .8")) # max. 80% of requested mem
    MAX_HEAP=$([ "${{MAX_HEAP}}" -gt "31" ] && echo "31g" ||
        echo "${{MAX_HEAP}}g")
    export _JAVA_OPTIONS="-Djava.io.tmpdir=${{TMP}} -Xmx${{MAX_HEAP}}"

    # run dummy or real job
    if [ "{config.echo_run}" -eq "1" ]; then
        echo "{input}" "${{TMP}}" > "{output}"
    else
        # clean-up outdir prior to SV calling
        rm -fr ${{OUTDIR}}/*gridss*
        gridss gridss.CallVariants \
            WORKER_THREADS={threads} \
            REFERENCE_SEQUENCE="{input.fasta}" \
            {params.excl_opt} \
            INPUT="{input.normal_bam}" \
            INPUT="{input.tumor_bam}" \
            OUTPUT="${{OUTFILE}}" \
            ASSEMBLY="${{OUTDIR}}/gridss_assembly.bam" \
            WORKING_DIR="${{TMP}}" \
            TMP_DIR="${{TMP}}/gridss.${{RANDOM}}"
        # somatic + SV quality filtering
        #   'normal' sample assumes index 0
        bcftools filter \
            -O v `# uncompressed VCF format` \
            -o "{output}" \
            -i "FORMAT/QUAL[0] == 0 && FILTER == '.'" \
            "${{OUTFILE}}"
    fi
    """
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
shell:
    """
    set -xe

    # if 'tmpspace' set to >0MB use TMPDIR otherwise use OUTDIR
    OUTDIR="$(dirname "{output}")"
    PREFIX="$(basename "{output}" .vcf)"
    OUTFILE="${{OUTDIR}}/${{PREFIX}}.unfiltered.vcf"
    TMP=$([ "{resources.tmp_mb}" -eq "0" ] &&
        echo "${{OUTDIR}}" || echo "${{TMPDIR}}")

    # set JVM max. heap size dynamically (in GB)
    # N.B. don't allocate >31G due to Compressed Oops and JDK-8029679
    MAX_HEAP=$(LC_ALL=C printf "%.f" $(bc <<< "scale=2; \
        {resources.mem_mb} / 1024 * .8")) # max. 80% of requested mem
    MAX_HEAP=$([ "${{MAX_HEAP}}" -gt "31" ] && echo "31g" ||
        echo "${{MAX_HEAP}}g")
    export _JAVA_OPTIONS="-Djava.io.tmpdir=${{TMP}} -Xmx${{MAX_HEAP}}"

    # run dummy or real job
    if [ "{config.echo_run}" -eq "1" ]; then
        echo "{input}" "${{TMP}}" > "{output}"
    else
        # clean-up outdir prior to SV calling
        rm -fr ${{OUTDIR}}/*gridss* &&
        gridss gridss.CallVariants \
            WORKER_THREADS={threads} \
            REFERENCE_SEQUENCE="{input.fasta}" \
            {params.excl_opt} \
            INPUT="{input.bam}" \
            OUTPUT="${{OUTFILE}}" \
            ASSEMBLY="${{OUTDIR}}/gridss_assembly.bam" \
            WORKING_DIR="${{TMP}}" \
            TMP_DIR="${{TMP}}/gridss.${{RANDOM}}"
        # SV quality filtering
        bcftools filter \
            -O v `# uncompressed VCF format` \
            -o "{output}" \
            -i "FILTER == '.'" \
            "${{OUTFILE}}"
    fi
    """
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
shell:
    """
    set -xe

    # if 'tmpspace' set to >0MB use TMPDIR otherwise use OUTDIR
    OUTDIR="$(dirname "{output}")"
    PREFIX="$(basename "{output}" .vcf)"
    OUTFILE="${{OUTDIR}}/${{PREFIX}}.unfiltered.vcf"
    TMP=$([ "{resources.tmp_mb}" -eq "0" ] &&
        echo "${{OUTDIR}}" || echo "${{TMPDIR}}")

    # run dummy or real job
    if [ "{config.echo_run}" -eq "1" ]; then
        echo "{input}" "${{TMP}}" > "{output}"
    else
        lumpyexpress \
            -B "{input.tumor_bam}","{input.normal_bam}" \
            {params.excl_opt} \
            -o "${{OUTFILE}}" \
            -m 4 `# min. sample weight` \
            -r 0 `# trim threshold` \
            -k `# keep tmp files` \
            -T "${{TMP}}/lumpy.${{RANDOM}}"
        # somatic + SV quality filtering
        #   'normal' sample assumes index 1
        bcftools filter \
            -O v `# uncompressed VCF format` \
            -o "{output}" \
            -i "FORMAT/SU[1] == 0 && FILTER == '.'" \
            "${{OUTFILE}}"
    fi
    """
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
shell:
    """
    set -xe

    # if 'tmpspace' set to >0MB use TMPDIR otherwise use OUTDIR
    OUTDIR="$(dirname "{output}")"
    PREFIX="$(basename "{output}" .vcf)"
    OUTFILE="${{OUTDIR}}/${{PREFIX}}.unfiltered.vcf"
    TMP=$([ "{resources.tmp_mb}" -eq "0" ] &&
        echo "${{OUTDIR}}" || echo "${{TMPDIR}}")

    # run dummy or real job
    if [ "{config.echo_run}" -eq "1" ]; then
        echo "{input}" "${{TMP}}" > "{output}"
    else
        lumpyexpress \
            -B "{input.bam}" \
            {params.excl_opt} \
            -o "${{OUTFILE}}" \
            -m 4 `# min. sample weight` \
            -r 0 `# trim threshold` \
            -k `# keep tmp files` \
            -T "${{TMP}}/lumpy.${{RANDOM}}"
        # SV quality filtering
        bcftools filter \
            -O v `# uncompressed VCF format` \
            -o "{output}" \
            -i "FILTER == '.'" \
            "${{OUTFILE}}"
    fi
    """
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
shell:
    """
    set -xe

    OUTDIR="$(dirname "{output}")"
    OUTFILE="$(basename "{output}")"

    # run dummy or real job
    if [ "{config.echo_run}" -eq "1" ]; then
        echo "{input}" > "{output}"
    else
        configManta.py \
            --runDir "${{OUTDIR}}" \
            --reference "{input.fasta}" \
            --tumorBam "{input.tumor_bam}" \
            --normalBam "{input.normal_bam}"
        cd "${{OUTDIR}}"
        ./runWorkflow.py \
            --quiet \
            -m local \
            -j {threads}
        # SV quality filtering
        bcftools filter \
            -O v `# uncompressed VCF format` \
            -o "${{OUTFILE}}" \
            -i "FILTER == 'PASS'" \
            "{params.outfile}"
    fi
    """
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
shell:
    """
    set -xe

    OUTDIR="$(dirname "{output}")"
    OUTFILE="$(basename "{output}")"

    # run dummy or real job
    if [ "{config.echo_run}" -eq "1" ]; then
        echo "{input}" > "{output}"
    else
        configManta.py \
            --runDir "${{OUTDIR}}" \
            --reference "{input.fasta}" \
            {params.bam_opt} "{input.bam}"
        cd "${{OUTDIR}}"
        ./runWorkflow.py \
            --quiet \
            -m local \
            -j {threads}
        # SV quality filtering
        bcftools filter \
            -O v `# uncompressed VCF format` \
            -o "${{OUTFILE}}" \
            -i "FILTER == 'PASS'" \
            "{params.outfile}"
    fi
    """
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
shell:
    """
    set -xe

    # run dummy or real job
    if [ "{config.echo_run}" -eq "1" ]; then
        cat "{input}" > "{output}"
    else
        if [ "{params.excl}" -eq "1" ]; then
            SURVIVOR filter "{input}" {params.args} "{output}"
        else
            ln -sr "{input}" "{output}"
        fi
    fi
    """
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
shell:
    """
    set -xe

    # create a list of VCF files
    for f in $(echo "{input}")
    do
        echo "$f" >> "{output[0]}"
    done

    # run dummy or real job
    if [ "{config.echo_run}" -eq "1" ]; then
        cat "{output[0]}" > "{output[1]}"
    else
        SURVIVOR merge "{output[0]}" {params.args} "{output[1]}"
    fi
    """
38
39
script:
    "../scripts/viola_vcf.py"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import os
import shutil
import viola

vcf_in = str(snakemake.input)
vcf_org = vcf_in + '.org'
vcf_out = str(snakemake.output)
caller = str(snakemake.wildcards.prefix)

if caller == 'gridss':
    os.rename(vcf_in, vcf_org)
    with open(vcf_in, 'w') as new:
        with open(vcf_org, 'r') as org:
            for line in org:
                # FIX INFO field: change PARID to MATEID
                new.write(line.replace('PARID', 'MATEID'))
try:
    sv = viola.read_vcf(vcf_in, variant_caller=caller).breakend2breakpoint()
    sv.to_vcf(vcf_out)
except Exception:
    shutil.copyfile(vcf_in, vcf_out)
ShowHide 8 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://research-software.nl/software/sv-callers
Name: sv-callers
Version: v1.2.2
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: Apache License 2.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...