V-pipe is a pipeline designed for analysing NGS data of short viral genomes

public public 7mo ago Version: v2.99.3 0 bookmarks

Logo

V-pipe is a workflow designed for the analysis of next generation sequencing (NGS) data from viral pathogens. It produces a number of results in a curated format (e.g., consensus sequences, SNV calls, local/global haplotypes). V-pipe is written using the Snakemake workflow management system.

Usage

Different ways of initializing V-pipe are presented below. We strongly encourage you to deploy it using the quick install script , as this is our preferred method.

To configure V-pipe refer to the documentation present in config/README.md .

V-pipe expects the input samples to be organized in a two-level directory hierarchy, and the sequencing reads must be provided in a sub-folder named raw_data . Further details can be found on the website . Check the utils subdirectory for mass-importers tools that can assist you in generating this hierarchy.

We provide virus-specific base configuration files which contain handy defaults for, e.g., HIV and SARS-CoV-2. Set the virus in the general section of the configuration file:

general:
 virus_base_config: hiv

Also see snakemake's documentation to learn more about the command-line options available when executing the workflow.

Tutorials introducing usage of V-pipe are available in the docs/ subdirectory.

Using quick install script

To deploy V-pipe, use the installation script with the following parameters:

curl -O 'https://raw.githubusercontent.com/cbg-ethz/V-pipe/master/utils/quick_install.sh'
./quick_install.sh -w work

This script will download and install miniconda, checkout the V-pipe git repository (use -b to specify which branch/tag) and setup a work directory (specified with -w ) with an executable script that will execute the workflow:

cd work
# edit config.yaml and provide samples/ directory
./vpipe --jobs 4 --printshellcmds --dry-run

Using Docker

Note: the docker image is only setup with components to run the workflow for HIV and SARS-CoV-2 virus base configurations. Using V-pipe with other viruses or configurations might require internet connectivity for additional software components.

Create config.yaml or vpipe.config and then populate the samples/ directory. For example, the following config file could be used:

general:
 virus_base_config: hiv
output:
 snv: true
 local: true
 global: false
 visualization: true
 QA: true

Then execute:

docker run --rm -it -v $PWD:/work ghcr.io/cbg-ethz/v-pipe:master --jobs 4 --printshellcmds --dry-run

Using Snakedeploy

First install mamba , then create and activate an environment with Snakemake and Snakedeploy:

mamba create -c bioconda -c conda-forge --name snakemake snakemake snakedeploy
conda activate snakemake

Snakemake's official workflow installer Snakedeploy can now be used:

snakedeploy deploy-workflow https://github.com/cbg-ethz/V-pipe --tag master .
# edit config/config.yaml and provide samples/ directory
snakemake --use-conda --jobs 4 --printshellcmds --dry-run

Dependencies

  • Conda

    Conda is a cross-platform package management system and an environment manager application. Snakemake uses mamba as a package manager.

  • Snakemake

    Snakemake is the central workflow and dependency manager of V-pipe. It determines the order in which individual tools are invoked and checks that programs do not exit unexpectedly.

  • VICUNA

    VICUNA is a de novo assembly software designed for populations with high mutation rates. It is used to build an initial reference for mapping reads with ngshmmalign aligner when a references/cohort_consensus.fasta file is not provided. Further details can be found in the wiki pages.

Computational tools

Other dependencies are managed by using isolated conda environments per rule, and below we list some of the computational tools integrated in V-pipe:

  • FastQC

    FastQC gives an overview of the raw sequencing data. Flowcells that have been overloaded or otherwise fail during sequencing can easily be determined with FastQC.

  • PRINSEQ

    Trimming and clipping of reads is performed by PRINSEQ. It is currently the most versatile raw read processor with many customization options.

  • ngshmmalign

    We perform the alignment of the curated NGS data using our custom ngshmmalign that takes structural variants into account. It produces multiple consensus sequences that include either majority bases or ambiguous bases.

  • bwa

    In order to detect specific cross-contaminations with other probes, the Burrows-Wheeler aligner is used. It quickly yields estimates for foreign genomic material in an experiment. Additionally, It can be used as an alternative aligner to ngshmmalign.

  • MAFFT

    To standardise multiple samples to the same reference genome (say HXB2 for HIV-1), the multiple sequence aligner MAFFT is employed. The multiple sequence alignment helps in determining regions of low conservation and thus makes standardisation of alignments more robust.

  • Samtools and bcftools

    The Swiss Army knife of alignment postprocessing and diagnostics. bcftools is also used to generate consensus sequence with indels.

  • SmallGenomeUtilities

    We perform genomic liftovers to standardised reference genomes using our in-house developed python library of utilities for rewriting alignments.

  • ShoRAH

    ShoRAh performs SNV calling and local haplotype reconstruction by using bayesian clustering.

  • LoFreq

    LoFreq (version 2) is SNVs and indels caller from next-generation sequencing data, and can be used as an alternative engine for SNV calling.

  • SAVAGE and Haploclique

    We use HaploClique or SAVAGE to perform global haplotype reconstruction for heterogeneous viral populations by using an overlap graph.

Citation

If you use this software in your research, please cite:

Posada-Céspedes S., Seifert D., Topolsky I., Jablonski K.P., Metzner K.J., and Beerenwinkel N. 2021. "V-pipe: a computational pipeline for assessing viral genetic diversity from high-throughput sequencing data." Bioinformatics , January. doi: 10.1093/bioinformatics/btab015 .

Contributions

* software maintainer ; ** group leader

Contact

We encourage users to use the issue tracker . For further enquiries, you can also contact the V-pipe Dev Team [email protected] .

Code Snippets

 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
shell:
    """
    CONSENSUS_NAME={wildcards.dataset}
    CONSENSUS_NAME="${{CONSENSUS_NAME#*/}}"
    CONSENSUS_NAME="${{CONSENSUS_NAME//\//-}}"

    source {params.FUNCTIONS}

    ERRFILE=$(basename {log.errfile})
    OUTFILE=$(basename {log.outfile})

    # 1. copy initial reference for bwa
    rm -rf {params.WORK_DIR}/
    mkdir -p {params.WORK_DIR}/
    cp {input.global_ref} {params.WORK_DIR}/consensus.fasta
    cd {params.WORK_DIR}

    # 2. create bwa index
    {params.BWA} index consensus.fasta 2> >(tee $ERRFILE >&2)

    # 3. create initial alignment
    if [[ {params.PAIRED_BOOL} == "true" ]]; then
        {params.BWA} mem -t {threads} consensus.fasta ../preprocessed_data/R{{1,2}}.fastq > first_aln.sam 2> >(tee -a $ERRFILE >&2)
    else
        {params.BWA} mem -t {threads} consensus.fasta ../preprocessed_data/R1.fastq > first_aln.sam 2> >(tee -a $ERRFILE >&2)
    fi
    rm consensus.fasta.*

    # 4. remove unmapped reads
    {params.SAMTOOLS} view -b -F 4 first_aln.sam > mapped.bam 2> >(tee -a $ERRFILE >&2)
    rm first_aln.sam

    # 5. extract reads
    mkdir -p cleaned
    SamToFastq {params.PICARD} I=mapped.bam FASTQ=cleaned/R1.fastq {params.PAIRED} VALIDATION_STRINGENCY=SILENT 2> >(tee -a $ERRFILE >&2)
    rm mapped.bam

    # 6. create config file
    # NOTE: Tabs are required below
    if [[ {params.PAIRED_BOOL} == "true" ]]; then
        cat > vicuna_config.txt <<- _EOF_
            minMSize    9
            maxOverhangSize    2
            Divergence    8
            max_read_overhang    2
            max_contig_overhang    10
            pFqDir    cleaned/
            batchSize    100000
            LibSizeLowerBound    100
            LibSizeUpperBound    800
            min_output_contig_len    1000
            outputDIR    ./
        _EOF_
    else
        cat > vicuna_config.txt <<- _EOF_
            minMSize    9
            maxOverhangSize    2
            Divergence    8
            max_read_overhang    2
            max_contig_overhang    10
            npFqDir    cleaned/
            batchSize    100000
            min_output_contig_len    1000
            outputDIR    ./
        _EOF_
    fi

    # 7. VICUNA
    OMP_NUM_THREADS={threads} {params.VICUNA} vicuna_config.txt > $OUTFILE 2> >(tee -a $ERRFILE >&2)
    rm vicuna_config.txt
    rm -r cleaned/

    # 8. fix broken header
    sed -e 's:>dg-\([[:digit:]]\+\)\s.*:>dg-\1:g' contig.fasta > contig_clean.fasta

    # 9. InDelFixer + ConsensusFixer to polish up consensus
    for i in {{1..3}}
    do
            mv consensus.fasta old_consensus.fasta
            indelFixer {params.INDELFIXER} -i contig_clean.fasta -g old_consensus.fasta >> $OUTFILE 2> >(tee -a $ERRFILE >&2)
            sam2bam {params.SAMTOOLS} reads.sam >> $OUTFILE 2> >(tee $ERRFILE >&2)
            consensusFixer {params.CONSENSUSFIXER} -i reads.bam -r old_consensus.fasta -mcc 1 -mic 1 -d -pluralityN 0.01 >> $OUTFILE 2> >(tee $ERRFILE >&2)
    done

    sed -i -e "s/>.*/>${{CONSENSUS_NAME}}/" consensus.fasta
    echo "" >> consensus.fasta

    # 10. finally, move into place
    mkdir -p ../references
    mv {{,../references/vicuna_}}consensus.fasta
    """
158
159
160
161
162
163
164
165
shell:
    """
    cat {input} > initial_ALL.fasta
    {params.MAFFT} --nuc --preservecase --maxiterate 1000 --localpair --thread {threads} initial_ALL.fasta > references/initial_aln.fasta 2> >(tee {log.errfile} >&2)
    rm initial_ALL.fasta

    {params.REMOVE_GAPS} references/initial_aln.fasta -o {output} -p 0.5 > {log.outfile} 2> >(tee -a {log.errfile} >&2)
    """
SnakeMake From line 158 of rules/align.smk
181
182
183
184
185
186
187
188
189
shell:
    """
    CONSENSUS_NAME={wildcards.dataset}
    CONSENSUS_NAME="${{CONSENSUS_NAME#*/}}"
    CONSENSUS_NAME="${{CONSENSUS_NAME//\//-}}"

    mkdir -p {wildcards.dataset}/references/
    {params.EXTRACT_SEQ} {input} -o {output} -s "${{CONSENSUS_NAME}}"
    """
SnakeMake From line 181 of rules/align.smk
201
202
203
204
205
206
207
208
209
210
shell:
    """
    CONSENSUS_NAME={wildcards.dataset}
    CONSENSUS_NAME="${{CONSENSUS_NAME#*/}}"
    CONSENSUS_NAME="${{CONSENSUS_NAME//\//-}}"

    mkdir -p {wildcards.dataset}/references/
    cp {input} {output}
    sed -i -e "s/>.*/>${{CONSENSUS_NAME}}/" {output}
    """
SnakeMake From line 201 of rules/align.smk
222
223
224
225
226
227
228
229
230
231
shell:
    """
    CONSENSUS_NAME={wildcards.dataset}
    CONSENSUS_NAME="${{CONSENSUS_NAME#*/}}"
    CONSENSUS_NAME="${{CONSENSUS_NAME//\//-}}"

    mkdir -p {wildcards.dataset}/references/
    cp {input} {output}
    sed -i -e "s/>.*/>${{CONSENSUS_NAME}}/" {output}
    """
SnakeMake From line 222 of rules/align.smk
281
282
283
284
shell:
    """
    {params.GUNZIP} -c {input} > {output} 2> >(tee {log.errfile} >&2)
    """
SnakeMake From line 281 of rules/align.smk
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
shell:
    """
    CONSENSUS_NAME={wildcards.dataset}
    CONSENSUS_NAME="${{CONSENSUS_NAME#*/}}"
    CONSENSUS_NAME="${{CONSENSUS_NAME//\//-}}"

    # 1. clean previous run
    rm -rf   {wildcards.dataset}/alignments
    rm -f    {wildcards.dataset}/references/ref_ambig.fasta
    rm -f    {wildcards.dataset}/references/ref_majority.fasta
    mkdir -p {wildcards.dataset}/alignments
    mkdir -p {wildcards.dataset}/references

    # 2. perform alignment # -l = leave temps
    {params.NGSHMMALIGN} -v {params.EXTRA} -R {input.initial_ref} -o {output.good_aln} -w {output.reject_aln} -t {threads} -N "${{CONSENSUS_NAME}}" {params.LEAVE_TEMP} {input.FASTQ} > {log.outfile} 2> >(tee {log.errfile} >&2)

    # 3. move references into place
    mv {wildcards.dataset}/{{alignments,references}}/ref_ambig.fasta
    mv {wildcards.dataset}/{{alignments,references}}/ref_majority.fasta
    """
SnakeMake From line 314 of rules/align.smk
365
366
367
368
369
370
shell:
    """
    cat {input} > ALL_{wildcards.kind}.fasta
    {params.MAFFT} --nuc --preservecase --maxiterate 1000 --localpair --thread {threads} ALL_{wildcards.kind}.fasta > {output} 2> >(tee {log.errfile} >&2)
    rm ALL_{wildcards.kind}.fasta
    """
SnakeMake From line 365 of rules/align.smk
407
408
409
410
shell:
    """
    {params.CONVERT_REFERENCE} -t {params.REF_NAME} -m {input.REF_ambig} -i {input.BAM} -o {output} > {log.outfile} 2> >(tee {log.errfile} >&2)
    """
SnakeMake From line 407 of rules/align.smk
441
442
443
444
445
446
447
shell:
    """
    echo "Writing BAM file"
    rm -f '{params.sort_tmp}'.[0-9]*.bam
    {params.SAMTOOLS} sort -T "{params.sort_tmp}" -o "{output.BAM}" "{input}" > {log.outfile} 2> >(tee {log.errfile} >&2)
    {params.SAMTOOLS} index "{output.BAM}" >> {log.outfile} 2> >(tee -a {log.errfile} >&2)
    """
SnakeMake From line 441 of rules/align.smk
475
476
477
478
shell:
    """
    {params.BWA} index {input} 2> >(tee {log.errfile} >&2)
    """
SnakeMake From line 475 of rules/align.smk
521
522
523
524
525
526
shell:
    """
    {params.BWA} mem -t {threads} {params.EXTRA} -o "{output.TMP_SAM}" "{input.REF}" {input.FASTQ} 2> >(tee {log.errfile} >&2)
    # Filter alignments: (1) remove unmapped reads (single-end) or keep only reads mapped in proper pairs (paired-end), (2) remove supplementary aligments
    {params.SAMTOOLS} view -h {params.FILTER} -F 2048 -o "{output.REF}" "{output.TMP_SAM}" 2> >(tee -a {log.errfile} >&2)
    """
SnakeMake From line 521 of rules/align.smk
551
552
553
554
shell:
    """
    {params.BOWTIE} {input} {input} 2> >(tee {log.errfile} >&2)
    """
SnakeMake From line 551 of rules/align.smk
588
589
590
591
592
593
shell:
    """
    {params.BOWTIE} -x {input.REF} -1 {input.R1} -2 {input.R2} {params.PHRED} {params.PRESET} -X {params.MAXINS} {params.EXTRA} -p {threads} -S {output.TMP_SAM} 2> >(tee {log.errfile} >&2)
    # Filter alignments: (1) keep only reads mapped in proper pairs, and (2) remove supplementary aligments
    {params.SAMTOOLS} view -h -f 2 -F 2048 -o "{output.REF}" "{output.TMP_SAM}" 2> >(tee -a {log.errfile} >&2)
    """
SnakeMake From line 588 of rules/align.smk
625
626
627
628
629
630
shell:
    """
    {params.BOWTIE} -x {input.REF} -U {input.R1} {params.PHRED} {params.PRESET} {params.EXTRA} -p {threads} -S {output.TMP_SAM} 2> >(tee {log.errfile} >&2)
    # Filter alignments: (1) remove unmapped reads, and (2) remove supplementary aligments
    {params.SAMTOOLS} view -h -F 4 -F 2048 -o "{output.REF}" "{output.TMP_SAM} 2> >(tee -a {log.errfile} >&2)
    """
SnakeMake From line 625 of rules/align.smk
11
12
13
14
shell:
    """
    rm -rf {params.DIR}/*/*/extracted_data
    """
20
21
22
23
shell:
    """
    rm -rf {params.DIR}/*/*/preprocessed_data
    """
29
30
31
32
33
34
35
36
37
shell:
    """
    rm -rf {params.DIR}/*/*/initial_consensus
    rm -rf {params.DIR}/*/*/references/vicuna_consensus.fasta
    rm -rf {params.DIR}/*/*/references/initial_consensus.fasta
    rm -rf references/initial_aln.fasta
    rm -rf references/initial_aln_gap_removed.fasta
    rm -rf references/MAFFT_initial_aln.*
    """
41
42
43
44
45
shell:
    """
    rm -rf references/ALL_aln_*.fasta
    rm -rf references/MAFFT_*_cohort.*
    """
51
52
53
54
55
56
57
58
shell:
    """
    rm -rf {params.DIR}/*/*/alignments
    rm -rf {params.DIR}/*/*/QA_alignments
    rm -rf {params.DIR}/*/*/references/ref_ambig.fasta
    rm -rf {params.DIR}/*/*/references/ref_majority.fasta
    rm -rf {params.DIR}/*/*/references/initial_consensus.fasta
    """
66
67
68
69
70
71
72
shell:
    """
    rm -f {input}
    rm -rf {params.DIR}/*/*/alignments
    rm -rf {params.DIR}/*/*/references/ref_ambig*.fasta
    rm -rf {params.DIR}/*/*/references/ref_majority*.fasta
    """
80
81
82
83
84
85
86
shell:
    """
    rm -f {input}
    rm -rf {params.DIR}/*/*/alignments
    rm -rf {params.DIR}/*/*/references/ref_ambig*.fasta
    rm -rf {params.DIR}/*/*/references/ref_majority*.fasta
    """
92
93
94
95
shell:
    """
    rm -rf {params.DIR}/*/*/variants/SNVs
    """
101
102
103
104
105
shell:
    """
    rm -rf {params.DIR}/*/*/variants/global/contigs_stage_?.fasta
    rm -rf {params.DIR}/*/*/variants/global/stage_?
    """
SnakeMake From line 101 of rules/clean.smk
111
112
113
114
shell:
    """
    rm {params.DIR}/*/*/variants/global/quasispecies.*
    """
SnakeMake From line 111 of rules/clean.smk
120
121
122
123
shell:
    """
    rm -rf {params.DIR}/*/*/variants/global/predicthaplo/
    """
SnakeMake From line 120 of rules/clean.smk
129
130
131
132
shell:
    """
    rm -rf {params.DIR}/*/*/visualization
    """
SnakeMake From line 129 of rules/clean.smk
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
shell:
    """
    {params.bcftools} mpileup \
        --threads {threads} \
        -Ou \
        -f {input.fname_ref} \
        --max-depth {params.max_coverage} \
        --max-idepth {params.max_coverage} \
        --annotate FORMAT/AD,FORMAT/DP,INFO/AD \
        {input.fname_bam} \
    | {params.bcftools} call \
        --threads {threads} \
        -Ou \
        -mv \
        --keep-alts \
    | {params.bcftools} norm \
        --threads {threads} \
        -Ou \
        -f {input.fname_ref} \
    | {params.bcftools} filter \
        --threads {threads} \
        -e 'TYPE="INDEL" & INFO/AD[1]<INFO/AD[0]' \
        -Ob \
        --output {output.fname_temp_bcf} \
        2> >(tee {log.errfile} >&2)
    #bcftools csq -f {input.fname_ref} -g wheretogetthis.gff3.gz in.vcf -Ob -o out.bcf

    {params.gunzip} -c {input.fname_cov} | tail -n +2 \
    | awk -v base={params.tsvbased} \
        '$3 < {params.mask_coverage_threshold} {{printf "%s\\t%d\\t%d\\n", $1, $2 - base, $2 - base + 1}}' \
    > {output.fname_mask_lowcoverage} \
        2> >(tee -a {log.errfile} >&2)

    # preparations
    {params.enhance_bcf} \
       {output.fname_temp_bcf} \
       {output.fname_bcf} \
       {params.ambiguous_base_coverage_threshold} \
       2> >(tee -a {log.errfile} >&2)

    {params.bcftools} index {output.fname_bcf} 2> >(tee -a {log.errfile} >&2)

    common_consensus_params="--fasta-ref {input.fname_ref} --mark-del - --mask {output.fname_mask_lowcoverage} --mask-with n"

    # majority bases
    {params.bcftools} consensus \
        --output {output.fname_fasta} \
        --chain {output.fname_chain} \
        $common_consensus_params \
        -H A \
        -i "INFO/AD[0]<INFO/AD[*]" \
        {output.fname_bcf} \
        2> >(tee -a {log.errfile} >&2)

    # ambiguous bases
    {params.bcftools} consensus \
        --output {output.fname_fasta_ambig} \
        --chain {output.fname_chain_ambig} \
        $common_consensus_params \
        -H I --iupac-codes \
        {output.fname_bcf} \
        2> >(tee -a {log.errfile} >&2)
    """
140
141
142
143
144
145
146
147
shell:
    """
    CONSENSUS_NAME={wildcards.dataset}
    CONSENSUS_NAME="${{CONSENSUS_NAME#*/}}"
    CONSENSUS_NAME="${{CONSENSUS_NAME//\//-}}"

    {params.EXTRACT_CONSENSUS} -i {input.BAM} -f {input.REF} -c {params.MIN_COVERAGE} -n {params.N_COVERAGE} -q {params.QUAL_THRD} -a {params.MIN_FREQ} -N "${{CONSENSUS_NAME}}" -o {params.OUTDIR} > {log.outfile} 2> >(tee -a {log.errfile} >&2)
    """
172
173
174
175
176
177
178
179
180
shell:
    """
    if tail -n +2 {input.REF_majority_dels} | grep -qE '[^n]'; then
        {params.MATCHER} -asequence {input.REF} -bsequence {input.REF_majority_dels} -outfile {output.REF_matcher} 2> >(tee {log.errfile} >&2)
    else
        touch {output.REF_matcher}
        echo "pure 'nnnn...' consensus, no possible alignement" | tee {log.outfile}
    fi
    """
210
211
212
213
shell:
    """
    {params.FRAMESHIFT_DEL_CHECKS} -i {input.BAM} -c {input.REF_majority_dels} -f {input.REF_NAME} -g {input.GENES_GFF} --english=true -o {output.FRAMESHIFT_DEL_CHECK_TSV} 2> >(tee {log.errfile} >&2)
    """
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
shell:
    """
    # 1. cleanup old run
    rm -f {output.SAM} {output.MSA}

    # 2. concatenate references
    mkdir -p {wildcards.dataset}/QA_alignments
    cat {input.patient_ref} {input.virusmix_ref} > {wildcards.dataset}/QA_alignments/bwa_refs_{wildcards.kind}.fasta

    # 3. indexing
    {params.BWA} index {wildcards.dataset}/QA_alignments/bwa_refs_{wildcards.kind}.fasta 2> >(tee {log.errfile} >&2)

    # 4. align
    {params.BWA} mem -t {threads} {wildcards.dataset}/QA_alignments/bwa_refs_{wildcards.kind}.fasta {input.FASTQ} > {output.SAM} 2> >(tee -a {log.errfile} >&2)

    # 5. MSA
    {params.MAFFT} --nuc --preservecase --maxiterate 1000 --localpair --thread {threads} {wildcards.dataset}/QA_alignments/bwa_refs_{wildcards.kind}.fasta > {output.MSA} 2> >(tee -a {log.errfile} >&2)

    # 6. cleanup BWA indices
    rm -f {wildcards.dataset}/QA_alignments/bwa_refs_{wildcards.kind}.fasta.*
    """
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
shell:
    """
    CONSENSUS_NAME={wildcards.dataset}
    CONSENSUS_NAME="${{CONSENSUS_NAME#*/}}"
    CONSENSUS_NAME="${{CONSENSUS_NAME//\//-}}"

    # 1. clean previous run
    rm -f {output}
    mkdir -p {wildcards.dataset}/QA_alignments

    # 2. collect coverage stats
    # we only collect statistics in the loop regions
    # of HIV-1 in order
    {params.COV_STATS} -t {params.TARGET} -i {input.BAM} -o {output} -m {input.MSA} --select "${{CONSENSUS_NAME}}" > {log.outfile} 2> >(tee {log.errfile} >&2)
    """
35
36
37
38
39
40
41
42
43
44
45
46
47
shell:
    """
    echo "Keep reject  -----------------------------------------------------"
    echo

    {params.SAMTOOLS} bam2fq -@ {threads} \
                             -F 2 \
                             -1 {output.reject_1} \
                             -2 {output.reject_2} \
                             {input.reject_aln} \
                             2> >(tee {log.errfile} >&2)
    echo
    """
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
shell:
    """
    echo "Filter out virus' reads  -----------------------------------------"
    echo

    {params.BWA} mem -t {threads} \
                     -o {output.tmp_aln} \
                     {input.global_ref} {input.fastq} \
                     > {log.outfile} 2> >(tee {log.errfile} >&2)

    echo
    echo "Keep reject  -----------------------------------------------------"
    echo

    {params.SAMTOOLS} bam2fq -@ {threads} \
                             -F 2 \
                             -1 {output.reject_1} \
                             -2 {output.reject_2} \
                             {output.tmp_aln} \
                             2> >(tee -a {log.errfile} >&2)
    echo
    """
108
109
110
111
shell:
    """
    curl --output "{output.host_ref}" "{params.host_ref_url}"
    """
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
shell:
    """
    # using zcat FILENAME.gz causes issues on Mac, see
    # https://serverfault.com/questions/570024/
    # redirection fixes this:
    unpack_rawreads() {{
        for fq in "${{@}}"; do
            zcat -f < "${{fq}}"
        done
    }}

    echo
    echo "Count aligned reads ---------------------------------------------"
    echo

    count=$({params.SAMTOOLS} view -@ {threads} -c -f {params.F} -F 2304 {input.host_aln} | tee {output.filter_count} 2> >(tee {log.errfile} >&2) )

    if (( count > 0 )); then
        echo
        echo "-----------------------------------------------------------------"
        echo "Needs special care: ${{count}} potential human reads found"
        echo "-----------------------------------------------------------------"
        echo
        echo "Removing identified host reads from raw reads -------------------"
        echo

        # get list
        {params.SAMTOOLS} view -@ {threads} \
                               -f {params.F} \
                               {input.host_aln} \
                               | cut -f 1 > {output.filter_list} 2> >(tee -a {log.errfile} >&2)

        unpack_rawreads {input.R1:q} \
               | {params.remove_reads_script} {output.filter_list} \
               | gzip \
               > {output.filtered_1} 2> >(tee -a {log.errfile} >&2) &

        unpack_rawreads {input.R2:q} \
               | {params.remove_reads_script} {output.filter_list} \
               | gzip \
               > {output.filtered_2} 2> >(tee -a {log.errfile} >&2) &

        wait

        if (( {params.keep_host} )); then
            # keep the rejects for further analysis

            echo
            echo "Keeping host-aligned virus' rejects ------------------------------"
            echo


            # (we compress reference-less, because the reference size is larger
            # than the contaminant reads)
            rm -f '{params.sort_tmp}'.[0-9]*.bam
            FMT=cram,no_ref,use_bzip2,use_lzma,level=9,seqs_per_slice=1000000
            {params.SAMTOOLS} view -@ {threads} \
                                  -h -f 2 \
                                  {input.host_aln} \
                | {params.SAMTOOLS} sort -@ {threads} \
                                -T {params.sort_tmp} \
                                --output-fmt ${{FMT}} \
                                -o {params.host_aln_cram} \
                                2> >(tee -a {log.errfile} >&2)

            echo
            echo "Compressing host-depleted raw reads ------------------------------"
            echo

            {params.SAMTOOLS} index -@ {threads} {params.host_aln_cram} 2> >(tee -a {log.errfile} >&2)
        fi
    else
        echo
        echo "No potential human reads found -----------------------------------"
        echo "Copy raw reads file"
        echo
        unpack_rawreads {input.R1} | gzip > {output.filtered_1} 2> >(tee -a {log.errfile} >&2) &
        unpack_rawreads {input.R2} | gzip > {output.filtered_2} 2> >(tee -a {log.errfile} >&2) &
        wait
        touch {output.filter_list}
    fi
    echo
    """
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
shell:
    """
    echo "Compress filtered sequences --------------------------------------"
    echo

    {params.BWA} mem -t {threads} \
                     -C \
                     -o {output.cram_sam} \
                     {input.global_ref} {input.filtered_1} {input.filtered_2} \
                     > {log.outfile} 2> >(tee {log.errfile} >&2)

    # HACK handle incompatibilities between:
    #  - Illumina's 'bcl2fastq', which write arbitrary strings
    #  - 'bwa mem' which keep comments in the SAM file verbatim as in the FASTQ file
    #  - 'samtools' which expects comment to be properly marked as 'BC:Z:'
    #    as per SAM format specs
    REGEXP=\'s{{(?<=\\t)([[:digit:]]:[[:upper:]]:[[:digit:]]:([ATCGN]+(\+[ATCGN]+)?|[[:digit:]]+))$}}{{BC:Z:\\1}}\'
    FMT=cram,embed_ref,use_bzip2,use_lzma,level=9,seqs_per_slice=1000000

    rm -f '{params.sort_tmp}'.[0-9]*.bam
    perl -p -e ${{REGEXP}} {output.cram_sam} \
          | {params.SAMTOOLS} sort -@ {threads} \
                                   -T {params.sort_tmp} \
                                   -M \
                                   --reference {input.global_ref} \
                                   --output-fmt ${{FMT}} \
                                   -o {output.final_cram} \
                                   2> >(tee -a {log.errfile} >&2)

    {params.checksum_type}sum {output.final_cram} > {output.checksum} 2> >(tee -a {log.errfile} >&2)

    echo
    echo DONE -------------------------------------------------------------
    echo
    """
22
23
script:
    "../scripts/compute_diversity_measures.py"
53
54
script:
    "../scripts/aggregate_diversity.py"
38
39
40
41
shell:
    """
    {params.HAPLOCLIQUE} {params.EXTRA_PARAMETERS} {params.RELAX} {params.NO_SINGLETONS} {params.NO_PROB0} --limit_clique_size={params.CLIQUE_SIZE_LIMIT} --max_cliques={params.MAX_NUM_CLIQUES} --log={log.outfile} --bam {input} {params.OUTPREFIX} 2> >(tee {log.errfile} >&2)
    """
70
71
72
73
shell:
    """
    {params.COMPUTE_MDS} -q {params.INPREFIX} -s {params.REGION_START} -e {params.REGION_END} {params.USE_MSA} {params.MSA} -p {output.PDF} -o {params.TSV} > {log.output} 2> >(tee {log.errfile} >&2)
    """
103
104
105
106
107
108
109
110
111
112
113
114
115
shell:
    """
    # Convert BAM to FASTQ without re-reversing reads - SAVAGE expect all reads in the same direction
    source {params.FUNCTIONS}
    SamToFastq {params.PICARD} I={input} FASTQ={output.R1} SECOND_END_FASTQ={output.R2} RC=false 2> >(tee {log.errfile} >&2)
    # Remove /1 and /2 from the read names
    sed -i -e "s:/1$::" {output.R1}
    sed -i -e "s:/2$::" {output.R2}

    R1=${{PWD}}/{output.R1}
    R2=${{PWD}}/{output.R2}
    {params.SAVAGE} -t {threads} --split {params.SPLIT} -p1 ${{R1}} -p2 ${{R2}} -o {params.OUTDIR} 2> >(tee -a {log.errfile} >&2)
    """
143
144
145
146
147
148
149
150
151
shell:
    """
    # Convert BAM to FASTQ without re-reversing reads - SAVAGE expect all reads in the same direction
    source {params.FUNCTIONS}
    SamToFastq {params.PICARD} I={input} FASTQ={output.R1} 2> >(tee {log.errfile} >&2)

    R1=${{PWD}}/{output.R1}
    {params.SAVAGE} -t {threads} --split {params.SPLIT} -s ${{R1}} -o {params.OUTDIR} 2> >(tee -a {log.errfile} >&2)
    """
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
shell:
    """
    {params.SAMTOOLS} sort -n {input.fname_bam} -o {output.fname_sam} 2> >(tee {log.errfile} >&2)

    mkdir -p {output.OUTPREFIX}
    {params.PREDICTHAPLO} \
        --sam {output.fname_sam} \
        --reference {input.fname_ref} \
        --prefix {output.OUTPREFIX}/ \
        --have_true_haplotypes 0 \
        --min_length {params.read_min_length} \
        2> >(tee -a {log.errfile} >&2)

    # TODO: copy over actual haplotypes
    touch {output.fname_out}
    """
34
35
36
37
shell:
    """
    {params.ALN2BASECNT} --first "{params.ARRAYBASED}" --basecnt "{output.BASECNT}" --coverage "{output.COVERAGE}" --name "{params.NAME}" --stats "{output.STATS}" "{input.BAM}" > {log.outfile} 2> >(tee {log.errfile} >&2)
    """
SnakeMake From line 34 of rules/mafs.smk
56
57
58
run:
    with open(output[0], "w") as out:
        out.write("\n".join(input))
SnakeMake From line 56 of rules/mafs.smk
101
102
103
104
shell:
    """
    {params.GATHER_COVERAGE} --output {output.COVERAGE} --stats {output.COVSTATS} --threads {threads} @{input.COVLIST} > >(tee {log.outfile}) 2> >(tee {log.errfile} >&2)
    """
SnakeMake From line 101 of rules/mafs.smk
155
156
157
158
shell:
    """
    {params.MINORITY_CALLER} -r {input.REF} -c {params.MIN_COVERAGE} -N {params.NAMES} -t {threads} -o {params.OUTDIR} {params.FREQUENCIES} {input.BAM} > >(tee {log.outfile}) 2> >(tee {log.errfile} >&2)
    """
SnakeMake From line 155 of rules/mafs.smk
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
shell:
    """
    echo "Trimming BAM with ivar"

    # iVar will Segfault without this:
    mkdir -p "$(dirname {params.ivar_tmp}"")"
    {params.IVAR} trim -e -i {input.BAM} -b {params.BED_PRIMERS} -p {params.ivar_tmp} > {log.outfile} 2> {log.errfile}

    # samtools complains without that:
    rm -f '{params.sort_tmp}'.[0-9]*.bam
    {params.SAMTOOLS}  sort -o {output.BAM} -T {params.sort_tmp} {params.ivar_tmp}.bam 2> >(tee -a {log.errfile} >&2)
    {params.SAMTOOLS}  index {output.BAM} 2> >(tee -a {log.errfile} >&2)
    rm -f {params.ivar_tmp}.bam

    """
100
101
102
103
104
105
106
107
108
109
110
shell:
    """
    echo "Trimming BAM with samtools"

    # samtools complains without that:
    rm -f '{params.sort_tmp}'.[0-9]*.bam

    {params.SAMTOOLS} ampliconclip -b {params.BED_PRIMERS} -f {output.stats} {input.BAM} |
        {params.SAMTOOLS}  sort -o {output.BAM} -T {params.sort_tmp} 2> >(tee {log.errfile} >&2)
    {params.SAMTOOLS}  index {output.BAM} 2> >(tee -a {log.errfile} >&2)
    """
62
63
64
65
shell:
    """
    {params.script} {params.options} "{output.upload_prepared_touch}" "{params.sample_id}" "{wildcards.dataset}" {input:q}
    """
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
shell:
    """
    # using zcat FILENAME.gz causes issues on Mac, see
    # https://serverfault.com/questions/570024/
    # redirection fixes this:
    unpack_rawreads() {{
        for fq in "${{@}}"; do
            zcat -f < "${{fq}}"
        done
    }}

    echo "Compress un-filtered sequences -----------------------------------"
    echo

    {params.BWA} mem -t {threads} \
                     -C \
                     -o {output.cram_sam} \
                     {input.global_ref} \
                     <(unpack_rawreads {input.R1:q}) \
                     <(unpack_rawreads {input.R2:q}) \
                     > {log.outfile} 2> >(tee {log.errfile} >&2)

    # HACK handle incompatibilities between:
    #  - Illumina's 'bcl2fastq', which write arbitrary strings
    #  - 'bwa mem' which keep comments in the SAM file verbatim as in the FASTQ file
    #  - 'samtools' which expects comment to be properly marked as 'BC:Z:'
    #    as per SAM format specs
    REGEXP=\'s{{(?<=\\t)([[:digit:]]:[[:upper:]]:[[:digit:]]:([ATCGN]+(\+[ATCGN]+)?|[[:digit:]]+))$}}{{BC:Z:\\1}}\'
    FMT=cram,embed_ref,use_bzip2,use_lzma,level=9,seqs_per_slice=1000000

    rm -f '{params.sort_tmp}'.[0-9]*.bam
    perl -p -e ${{REGEXP}} {output.cram_sam} \
          | {params.SAMTOOLS} sort -@ {threads} \
                                   -T {params.sort_tmp} \
                                   -M \
                                   --reference {input.global_ref} \
                                   --output-fmt ${{FMT}} \
                                   -o {output.final_cram} \
                                   2> >(tee -a {log.errfile} >&2)

    {params.checksum_type}sum {output.final_cram} > {output.checksum} 2> >(tee -a {log.errfile} >&2)

    echo
    echo DONE -------------------------------------------------------------
    echo
    """
158
159
160
161
shell:
    """
    {params.checksum_type}sum {input} > {output}
    """
30
31
32
33
shell:
    """
    {params.GUNZIP} -c {input} > {output}
    """
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
shell:
    """
    echo "The length cutoff is: {params.LEN_CUTOFF}" > {log.outfile}

    {params.PRINSEQ} -fastq {input.R1} -fastq2 {input.R2} {params.EXTRA} -out_format 3 -out_good {wildcards.dataset}/preprocessed_data/R -out_bad null -min_len {params.LEN_CUTOFF} -log {log.outfile} 2> >(tee {log.errfile} >&2)

    # make sure that the lock held prinseq has been effectively released and propagated
    # on some networked shares this could otherwise lead to confusion or corruption
    if [[ "$OSTYPE" =~ ^linux ]]; then
        echo "Waiting for unlocks" >&2
        for U in {wildcards.dataset}/preprocessed_data/R_{{1,2}}.fastq; do
            flock -x -o ${{U}} -c "echo ${{U}} unlocked >&2"
        done
    fi

    mv {wildcards.dataset}/preprocessed_data/R{{_,}}1.fastq
    mv {wildcards.dataset}/preprocessed_data/R{{_,}}2.fastq
    rm -f {wildcards.dataset}/preprocessed_data/R_?_singletons.fastq

    gzip {wildcards.dataset}/preprocessed_data/R1.fastq
    gzip {wildcards.dataset}/preprocessed_data/R2.fastq
    """
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
shell:
    """
    echo "The length cutoff is: {params.LEN_CUTOFF}" > {log.outfile}

    {params.PRINSEQ} -fastq {input.R1} {params.EXTRA} -out_format 3 -out_good {wildcards.dataset}/preprocessed_data/R -out_bad null -min_len {params.LEN_CUTOFF} {params.EXTRA} -log {log.outfile} 2> >(tee {log.errfile} >&2)

    # make sure that the lock held prinseq has been effectively released and propagated
    # on some network shares this could otherwise lead to confusion or corruption
    if [[ "$OSTYPE" =~ ^linux ]]; then
        echo "Waiting for unlocks" >&2
        for U in {wildcards.dataset}/preprocessed_data/R_{{1,2}}.fastq; do
            flock -x -o ${{U}} -c "echo ${{U}} unlocked >&2"
        done
    fi

    mv {wildcards.dataset}/preprocessed_data/R{{,1}}.fastq

    gzip {wildcards.dataset}/preprocessed_data/R1.fastq
    """
187
188
189
190
shell:
    """
    {params.FASTQC} -o {params.OUTDIR} -t {threads} {params.NOGROUP} {input} 2> >(tee {log.errfile} >&2)
    """
54
55
56
57
58
59
60
61
62
63
64
65
66
shell:
    """
    TMPTSV=$(mktemp -t XXXXXXXX_cov.tsv)
    TMPINT=$(mktemp -t XXXXXXXX_int.tsv)
    {params.GUNZIP} -c {input.TSV} | cut -f'2-' > $TMPTSV
    mkdir -p "$(dirname "{output}")"
    {params.EXTRACT_COVERAGE_INTERVALS} -c "{params.COVERAGE}" -w "{params.WINDOW_LEN}" -s "{params.SHIFT}" -N "{params.NAME}" {params.LIBERAL} -b {params.ARRAYBASED} {params.OVERLAP} -t "{threads}" -o $TMPINT "{input.BAM}" > >(tee {log.outfile}) 2>&1
    cat $TMPINT >> "{log.outfile}"
    read name intervals < $TMPINT
    IFS=',' read -r -a interarray <<< "$intervals"
    printf "%s\n" "${{interarray[@]}}" > "{output}"
    rm $TMPTSV $TMPINT
    """
SnakeMake From line 54 of rules/snv.smk
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
shell:
    """
    let "WINDOW_SHIFTS=({params.READ_LEN} * 4/5 + {params.SHIFT}) / {params.SHIFT}"
    let "WINDOW_LEN=WINDOW_SHIFTS * {params.SHIFT}"

    echo "Windows are shifted by: ${{WINDOW_SHIFTS}} bp" > {log.outfile}
    echo "The window length is: ${{WINDOW_LEN}} bp" >> {log.outfile}

    # Get absolute path for input files
    CWD=${{PWD}}
    BAM=${{PWD}}/{input.BAM}
    REF={input.REF}; [[ ${{REF}} =~ ^/ ]] || REF=${{PWD}}/${{REF}}
    OUTFILE=${{PWD}}/{log.outfile}
    ERRFILE=${{PWD}}/{log.errfile}
    WORK_DIR=${{PWD}}/{params.WORK_DIR}

    if [[ -n "{params.LOCALSCRATCH}" ]]; then
        # put input files in localscratch
        rsync -aq "${{BAM}}" "${{REF}}" "{params.LOCALSCRATCH}/"
        BAM="{params.LOCALSCRATCH}/${{BAM##*/}}"
        REF="{params.LOCALSCRATCH}/${{REF##*/}}"
    fi

    # Run ShoRAH in each of the predetermined regions (regions with sufficient coverage)
    LINE_COUNTER=0
    FILES=( )
    FILES_VCF=( )
    while read -r region || [[ -n ${{region}} ]]
    do
        echo "Running ShoRAH on region: ${{region}}" >> $OUTFILE
        (( ++LINE_COUNTER ))
        # Create directory for running ShoRAH in a corresponding region (if doesn't exist)
        DIR=${{WORK_DIR}}/REGION_${{LINE_COUNTER}}
        if [[ ! -d "${{DIR}}" ]]; then
            echo "Creating directory ${{DIR}}" >> $OUTFILE
            mkdir -p ${{DIR}}
        else
            # Results from previous runs
            if [[ {params.KEEP_FILES} == "true" ]]; then
                DIR_DST=${{WORK_DIR}}/old
                echo "Moving results from a previous run to ${{DIR_DST}}" >> $OUTFILE
                rm -rf ${{DIR_DST}}/REGION_${{LINE_COUNTER}}
                mkdir -p ${{DIR_DST}}
                mv -f ${{DIR}} ${{DIR_DST}}
                mkdir -p ${{DIR}}
            fi
        fi
        # Change to the directory where ShoRAH is to be executed
        if [[ -z "{params.LOCALSCRATCH}" ]]; then
            cd ${{DIR}}
        else
            # special case: go in local scratch instead
            rsync -aq "${{DIR}}" "{params.LOCALSCRATCH}/"
            mkdir -p "{params.LOCALSCRATCH}/REGION_${{LINE_COUNTER}}"
            cd "{params.LOCALSCRATCH}/REGION_${{LINE_COUNTER}}"
        fi

        # NOTE: Execution command for ShoRAH2 valid from v1.99.0 and above
        {params.SHORAH} -t {threads} -a {params.ALPHA} -w ${{WINDOW_LEN}} -x 100000 {params.IGNORE_INDELS} -p {params.POSTHRESH} -c {params.COVERAGE} -r ${{region}} -R 42 -b ${{BAM}} -f ${{REF}} >> $OUTFILE 2> >(tee -a $ERRFILE >&2)
        if [[ -n "{params.LOCALSCRATCH}" ]]; then
            # copyback from localscratch
            rsync -auq "{params.LOCALSCRATCH}/REGION_${{LINE_COUNTER}}" "${{WORK_DIR}}"
            cd ${{DIR}}
        fi
        if [[ -s ${{DIR}}/reads.fas && -f ${{DIR}}/snv/SNVs_0.010000_final.csv ]]; then
            # Non empty reads: run had enough data and should have produced SNVs
            {params.BCFTOOLS} view ${{DIR}}/snv/SNVs_0.010000_final.vcf -Oz -o ${{DIR}}/snv/SNVs_0.010000_final.vcf.gz
            {params.BCFTOOLS} index ${{DIR}}/snv/SNVs_0.010000_final.vcf.gz
            FILES+=("${{DIR}}/snv/SNVs_0.010000_final.csv")
            FILES_VCF+=("${{DIR}}/snv/SNVs_0.010000_final.vcf.gz")
        elif (( {params.COVINT} == 0 && LINE_COUNTER == 1 )) && [[ -f ${{DIR}}/reads.fas && ( ! -s ${{DIR}}/reads.fas ) ]]; then
            # if we have disabled coverage intervales entirely, the first and only line might have no reads
            # (e.g.: in negative controls )

            echo "No reads while coverage intervals disabled (possible negative control sample)" 2> >(tee -a $ERRFILE >&2)
            cd ${{CWD}}
            (( --LINE_COUNTER )) || true # Strict mode : (( 0 )) = fail
            break
        else
            echo "ERROR: unsuccesful execution of ShoRAH" 2> >(tee -a $ERRFILE >&2)
            exit 1
        fi

        # Change back to working directory
        cd ${{CWD}}
    done < {input.TSV}

    # Aggregate results from different regions
    if (( ${{#FILES[@]}} )); then
        echo "Intermediate csv files: ${{FILES[*]}}" >> {log.outfile}
        echo "Intermediate vcf files: ${{FILES_VCF[*]}}" >> {log.outfile}
        (head -n 1 "${{FILES[0]}}"; tail -q -n +2 "${{FILES[@]}}" | sort -t, -nk2) > {output.CSV}
        {params.BCFTOOLS} concat -o ${{WORK_DIR}}/snvs_tmp.vcf "${{FILES_VCF[@]}}"
        {params.BCFTOOLS} sort ${{WORK_DIR}}/snvs_tmp.vcf  -o {output.VCF}
        rm -f ${{WORK_DIR}}/snvs_tmp.vcf
    elif (( LINE_COUNTER )); then
        echo "ERROR: unsuccesful execution of ShoRAH" 2> >(tee -a {log.errfile} >&2)
        exit 1
    else
        echo "No alignment region reports sufficient coverage" >> {log.outfile}
        touch {output.CSV}
        touch {output.VCF}
    fi
    """
SnakeMake From line 118 of rules/snv.smk
242
243
244
245
shell:
    """
    {params.SAMTOOLS} faidx {input} -o {output} > {log.outfile} 2> >(tee -a {log.errfile} >&2)
    """
SnakeMake From line 242 of rules/snv.smk
288
289
290
291
292
293
294
295
296
297
298
shell:
    """
    # Add qualities to indels
    {params.LOFREQ} indelqual --dindel -f {input.REF} -o {output.BAM} --verbose {input.BAM} > {log.outfile} 2> >(tee {log.errfile} >&2)
    # Index bam file
    {params.SAMTOOLS} index {output.BAM} 2> >(tee {log.errfile} >&2)

    # Run Lofreq
    echo "Running LoFreq" >> {log.outfile}
    {params.LOFREQ} call {params.EXTRA} --call-indels -f {input.REF} -o {output.SNVs} --verbose {output.BAM} >> {log.outfile} 2> >(tee -a {log.errfile} >&2)
    """
SnakeMake From line 288 of rules/snv.smk
42
43
44
45
shell:
    """
    {params.EXTRACT_COVERAGE_INTERVALS} -b {params.ARRAYBASED} -cf {input.TSV} -c {params.COVERAGE} --no-shorah -N {params.NAMES} -o {output} {input.BAM} > >(tee {log.outfile}) 2>&1
    """
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
shell:
    """
    SAMPLE_ID="{params.ID}"

    # Number of input reads
    LINECOUNT=$( cat {input.R1} | wc -l )
    let "INPUT=LINECOUNT / {params.FACTOR}"

    # Number of reads after QC
    # For portability reason not using zcat
    {params.GUNZIP} -c {input.R1_QC} > {params.R1_temp}
    LINECOUNT=$( cat {params.R1_temp} | wc -l )
    let "READCOUNT=LINECOUNT / {params.FACTOR}"
    rm {params.R1_temp}

    # Number of aligned reads
    ALNCOUNT=$( {params.SAMTOOLS} view {input.BAM} | wc -l )

    echo -e "${{SAMPLE_ID}}\t${{INPUT}}\t${{READCOUNT}}\t${{ALNCOUNT}}" > {output}
    """
91
92
93
94
95
96
shell:
    """
    cat {input} > stats/temp
    echo -e "ID\tInput\tQC\tAlignments" | cat - stats/temp > {output}
    rm stats/temp
    """
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
shell:
    """
    # Why a shell directive?
    # 1) script directive crashes with `VpipeConfig`
    # 2) run directive does not allow conda envs

    if [ ! -z "{params.nwk_file}" ]
    then
      # generate phylogenetic tree
      augur align --sequences {input.phylogeny_data} {input.consensus_file} --output {params.alignment_file}
      augur tree --alignment {params.alignment_file} --output {params.nwk_file}
    fi

    # generate read alignment
    create_datauri {input.reference_file} > {output.reference_uri_file}
    create_datauri {input.bam_file} > {output.bam_uri_file}

    # generate html visualization pages
    python "{params.assemble_visualization_webpage}" \
        --consensus    "{input.consensus_file}" \
        --coverage    "{input.coverage_file}" \
        --tsvbased "{params.tsvbased}" \
        --vcf    "{input.vcf_file}" \
        --gff    "{input.gff_directory}" \
        --primers    "{input.primers_file}" \
        --metainfo    "{input.metainfo_file}" \
        --snv_calling_template    "{params.snv_visualization_template}" \
        --alignment_template    "{params.alignment_visualization_template}" \
        --html_out_snv_calling    "{output.snv_html_file}" \
        --html_out_alignment    "{output.alignment_html_file}" \
        --wildcards    "{wildcards.dataset}" \
        --reference    "{input.global_ref}" \
        --reference_uri_file "{output.reference_uri_file}" \
        --bam_uri_file "{output.bam_uri_file}" \
        --nwk "{params.nwk_file}" \
        > {log.outfile} 2> {log.errfile}

    """
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import pandas as pd


def main(in_fnames_diversity, in_fnames_shannon, out_diversity_csv, out_shannon_csv):
    merged_div_csv = pd.concat(
        [pd.read_csv(path_div) for path_div in in_fnames_diversity]
    )
    merged_div_csv.to_csv(out_diversity_csv)

    merged_shan_csv = pd.concat(
        [pd.read_csv(path_shan) for path_shan in in_fnames_shannon]
    )
    merged_shan_csv.to_csv(out_shannon_csv)


if __name__ == "__main__":
    main(
        snakemake.input.fnames_diversity,
        snakemake.input.fnames_shannon,
        snakemake.output.diversity_csv,
        snakemake.output.shannon_csv,
    )
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
import sys
import os
import vcf
import pandas as pd
import numpy as np
import skbio
from scipy.stats import sem


def number_of_polymorphisms(df_mutations, minor_allele_frequency=0):
    df_temp = df_mutations[df_mutations["frequency"] >= minor_allele_frequency]
    variant_positions = df_temp["position"].unique()
    return len(variant_positions)


def list_polymorphic_sites(df_mutations, minor_allele_frequency=0):
    df_temp = df_mutations[df_mutations["frequency"] >= minor_allele_frequency]
    variant_positions = df_temp["position"].unique()
    return variant_positions


def number_of_mutations(df_mutations, minimum_frequency=0):
    df_temp = df_mutations[df_mutations["frequency"] >= minimum_frequency]
    n_reads_with_mut = df_temp["frequency"] * df_temp["coverage"]
    n_reads_with_mut = df_temp["rvar"] + df_temp["fvar"]
    return n_reads_with_mut.sum()


def mutation_spectrum(df_mutations, bins):
    counts = np.zeros(len(bins))
    for i in range(len(bins) - 1):
        df_temp = df_mutations[df_mutations["frequency"] > bins[i]]
        df_temp = df_temp[df_temp["frequency"] <= bins[i + 1]]
        n_reads_with_mut = df_temp["frequency"] * df_temp["coverage"]
        counts[i + 1] = n_reads_with_mut.sum()
    return list(counts)


def population_nucleotide_diversity(df_mutations, length):
    # only the positions with mutations are needed
    pi = 0
    for position_temp in list_polymorphic_sites(df_mutations, minor_allele_frequency=0):
        df_temp = df_mutations[df_mutations["position"] == position_temp]
        N = df_temp["coverage"].unique()[0]
        if N == 0:
            continue
        freq = df_temp["frequency"].to_numpy()
        ref_freq = 1 - freq.sum()

        position_pnd = freq**2
        postion_pi = (1 - (position_pnd.sum() + ref_freq**2)) * N / (N - 1)

        pi += postion_pi

    return float(pi / length)


def position_Shannon_entropy(df_mutations, position):
    df_temp = df_mutations[df_mutations["position"] == position]
    position_shannon = 0
    sum_fraction = 0
    var_shannon = df_temp["frequency"].apply(lambda x: x * np.log(x) if x > 0 else 0)

    sum_fraction = df_temp["frequency"].sum()
    position_shannon = var_shannon.sum()

    # add the reference base summand
    if 1 - sum_fraction > 0:
        position_shannon += (1 - sum_fraction) * np.log(1 - sum_fraction)

    return -position_shannon


def mean_pos_Shannon_entropy(df_mutations, length):
    entropy = 0
    for position_temp in list_polymorphic_sites(df_mutations, minor_allele_frequency=0):
        entropy += position_Shannon_entropy(df_mutations, position_temp)
    return entropy / length


def convert_vcf(fname):
    """Convert VCF to JSON."""
    output = []

    if os.path.getsize(fname) == 0:
        print(f'Empty VCF: "{fname}"')
        return output

    print(f'Parsing VCF: "{fname}"')
    with open(fname) as fd:
        vcf_reader = vcf.Reader(fd)

        # check caller
        caller_source = vcf_reader.metadata["source"][0].lower()
        if caller_source.startswith("lofreq"):
            mode = "lofreq"
        elif caller_source.startswith("shorah"):
            mode = "shorah"
        else:
            raise RuntimeError(f"Invalid variant caller: {caller_source}")

        # output dataframe
        cols_snv = [
            "position",
            "reference",
            "variant",
            "frequency",
            "tvar",
            "fvar",
            "rvar",
            "ftot",
            "rtot",
            "coverage",
        ]
        df_snv = pd.DataFrame(columns=cols_snv)

        # parse records
        for record in vcf_reader:
            if mode == "lofreq":
                freq = round(record.INFO["AF"], 3)
                fvar = record.INFO["DP4"][2]  # counts variant forward reads
                rvar = record.INFO["DP4"][3]  # counts variant reverse reads
                rtot = (
                    record.INFO["DP4"][1] + record.INFO["DP4"][3]
                )  # counts reverse reads
                ftot = (
                    record.INFO["DP4"][0] + record.INFO["DP4"][2]
                )  # counts forward reads

            elif mode == "shorah":
                freq = round(
                    np.mean(
                        [v for k, v in record.INFO.items() if k.startswith("Freq")]
                    ),
                    3,
                )
                fvar = record.INFO["Fvar"]  # counts variant forward reads
                rvar = record.INFO["Rvar"]  # counts variant reverse reads
                rtot = record.INFO["Rtot"]  # counts reference reverse reads
                ftot = record.INFO["Ftot"]  # counts reference forward reads

            df_snv = df_snv.append(
                {
                    "position": record.POS,
                    "reference": record.REF,
                    "variant": [v.sequence for v in record.ALT],
                    "frequency": freq,
                    "tvar": fvar + rvar,
                    "fvar": fvar,
                    "rvar": rvar,
                    "ftot": ftot,
                    "rtot": rtot,
                    "coverage": ftot + rtot,
                },
                ignore_index=True,
            )

        id = record.CHROM
    return df_snv, id


def load_reference_seq(reference_file):
    for seq in skbio.io.read(reference_file, format="fasta"):
        return seq


def main(fname_snv_in, fname_reference, output_diversity_csv, output_shannon_csv):
    """
    Compute various diversity indices for underlying sample.
    Writes a csv-file with computations.

    Parameters
    ----------
    fname_snv_in:
        Absolute path to snv.vcf created by lofreq or ShoRAH.
    ref_seq_length:
        Length of the reference sequence.
    output_diversity_csv:
        Absolute path to diversity-csv.
    output_shannon_csv:
        Absolute path to position-wise Shannon-entropy csv.

    Output
    ------
    diversity_measures.csv:
        Listing all computed diversity measures.
    position_shannon_entropy.csv:
        Position-wise Shannon entropy, if non-zero.
    """
    # get length of reference sequence
    ref_seq_length = len(load_reference_seq(fname_reference))

    # Parse snv.vcf
    df_snv, id = convert_vcf(fname_snv_in)

    # Sample, patient, date information
    general_info = {
        "sample": fname_snv_in.split("/variants")[0].split("/")[-3],
        "patient": fname_snv_in.split("/variants")[0].split("/")[-2],
        "date": fname_snv_in.split("/variants")[0].split("/")[-1],
    }

    # prepare dict collecting all the diversity measures
    out_dict = {"id": id, "length": ref_seq_length}

    # number of mutations with different minor allele frequency
    out_dict.update(
        {
            "n_mutations_minFrq_0": number_of_mutations(df_snv, minimum_frequency=0),
            "n_mutations_minFrq_1": number_of_mutations(df_snv, minimum_frequency=0.01),
            "n_mutations_minFrq_5": number_of_mutations(df_snv, minimum_frequency=0.05),
        }
    )

    # sum mutation frequencies
    out_dict.update({"sum_mutation_frq": df_snv["frequency"].sum()})

    # mean mutation frequencies
    out_dict.update({"mean_mutation_frq": df_snv["frequency"].mean()})

    # the standard error of the mean (SEM) mutation frequency
    out_dict.update({"sem_mutation_frq": sem(df_snv["frequency"].to_numpy())})

    # population nucleotide diversity
    out_dict.update(
        {
            "population_nucleotide_diverstiy": population_nucleotide_diversity(
                df_snv, ref_seq_length
            )
        }
    )

    # mean position-wise Shannon entropy
    out_dict.update(
        {"mean_position_shannon": mean_pos_Shannon_entropy(df_snv, ref_seq_length)}
    )

    # mutation spectrum
    bins = np.arange(0, 1, 0.05)
    out_dict.update({"mutation_spectrum_bins": list(bins)})
    out_dict.update({"mutation_spectrum": mutation_spectrum(df_snv, bins)})

    out_dict.update(general_info)

    # save to dataframe
    df_diveristy = pd.DataFrame(columns=list(out_dict.keys()))
    df_diveristy = df_diveristy.append(out_dict, ignore_index=True)
    df_diveristy.to_csv(output_diversity_csv, index=False)

    # position-wise Shannon entropy
    pos_shannon_dict = general_info
    for i in range(ref_seq_length):
        if position_Shannon_entropy(df_snv, i) != 0:
            pos_shannon_dict.update({i: position_Shannon_entropy(df_snv, i)})
    df_pos_shannon = pd.DataFrame(columns=list(pos_shannon_dict.keys()))
    df_pos_shannon = df_pos_shannon.append(pos_shannon_dict, ignore_index=True)
    df_pos_shannon.to_csv(output_shannon_csv, index=False)


if __name__ == "__main__":
    main(
        snakemake.input.fnames_samples_snvs_vcf,
        snakemake.input.fname_ref,
        snakemake.output.diversity_csv,
        snakemake.output.shannon_csv,
    )
ShowHide 61 more snippets with no or duplicated tags.

Free

Created: 7mo ago
Updated: 7mo ago
Maitainers: public
URL: https://cbg-ethz.github.io/V-pipe/
Name: v-pipe
Version: v2.99.3
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Accessed: 6
Downloaded: 0
Copyright: Public Domain
License: Apache License 2.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...